libextractor ============ libextractor is a simple library for keyword extraction. libextractor does not support all formats but supports a simple plugging mechanism such that you can quickly add extractors for additional formats, even without recompiling libextractor. libextractor typically ships with a dozen helper-libraries that can be used to obtain keywords from common file-types. libextractor is a part of the GNU project (http://www.gnu.org/). extract ======= extract is a simple command-line interface to libextractor. Dependencies ============ libextractor requires Python (2.3, better 2.4 including development files) and a JNI header file (jni.h) for Java. Further requirements include: * libvorbisfile * zlib (compression library) * c++ compiler * libltdl 2.2.x (from GNU libtool) * libtool 1.5 or higher * GNU gettext * glib 2.6 * gtk 2.6 (for thumbnails, gdk-pixbuf) When building libextractor binaries, please make sure all of these dependencies are available. Otherwise the build system may automatically build only a subset of libextractor. Writing plugins =============== If you want to write your own extractor for some filetype, all you need to do is write a little library that implements a single method with this signature: KeywordList * _extract(const char * filename, const char * data, size_t size, KeywordList * prev, const char * options); where is the name of the library file that you will tell libExtractor to load, minus the suffix. For example, if you link your extractor into a file called 'myextractor.so', the method above should be called 'myextractor_extract'. The filename is the name of the file and maybe NULL, data is a pointer to the contents of the file and size is the size of the file. The extract method must prepend keywords that it finds to the linked list 'prev' and return the new head. The library must allocate (malloc) the entry in the keyword list and the memory for the filename since both will be free'ed by libextractor once the application calls freeKeywords. An example implementation can be found in mp3extractor.c. Notes ===== On Mac OS X, libextractor will avoid using GCC 3.1, because of problems compiling one of the extractors. GCC 3.3 and 2.95.2 are known to work well; as such, libextractor will first look for 3.3 (by attempting to run gcc-3.3, cpp-3.3, and g++-3.3) and then 2.95.2 (by attempting to run gcc2 and g++2). exiv2 requires G++ 3.0 or higher. With older GCC versions (and other broken C++ compilers), you have to manually disable exiv2 by passing "--disable-exiv2" to "configure" in order to avoid compilation problems. If libextractor fails to find the plugins, a possible method of last resort is to set the environment variable LIBEXTRACTOR_PREFIX to the parent of the directory where the plugins are installed (i.e., if the plugins are in "/foo/bar/lib/libextractor/*.so", set the variable to "/foo/bar/lib"). This should not be needed if "extract" is in "/foo/bar/bin/extract" and "/foo/bar/bin" is in the PATH, if you are running Linux and "libextractor.so" is in "/foo/bar/lib/libextractor.so", or if you are on linux and the binary using libextractor resides in "/foo/bar/bin", or if you are under Windows and "GetModuleFileName" returns "/foo/bar/bin". If none of these common circumstances apply, you may have to set the environment variable.