\input texinfo @c -*- Texinfo -*- @c % The structure of this document is based on the @c % Texinfo manual from libgcrypt by Werner Koch and @c % and Moritz Schulte. @c %**start of header @setfilename extractor.info @include version.texi @settitle The GNU libextractor Reference Manual @c Unify some of the indices. @c %**end of header @copying This manual is for GNU libextractor (version @value{VERSION}, @value{UPDATED}). GNU libextractor is a GNU package. Copyright @copyright{} 2007, 2010 Christian Grothoff @quotation Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.3 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled "GNU Free Documentation License". @end quotation @end copying @dircategory GNU Libraries @direntry * libextractor: (extractor). Meta data extraction library. @end direntry @c @c Titlepage @c @setchapternewpage odd @titlepage @title The GNU libextractor Reference Manual @subtitle Version @value{VERSION} @subtitle @value{UPDATED} @author Christian Grothoff (@email{christian@@grothoff.org}) @page @vskip 0pt plus 1filll @insertcopying @end titlepage @summarycontents @contents @page @macro gnu{} @acronym{GNU} @end macro @macro gpl{} @acronym{GPL} @end macro @macro api{} @acronym{API} @end macro @macro cfunction{arg} @code{\arg\()} @end macro @macro mynull{} @code{NULL} @end macro @macro gnule{} @acronym{GNU libextractor} @end macro @ifnottex @node Top @top The GNU libextractor Reference Manual @insertcopying @end ifnottex @menu * Introduction:: What is @gnule{}. * Preparation:: What you should do before using the library. * Generalities:: General library functions and data types. * Extracting meta data:: How to use @gnule{} to obtain meta data. * Language bindings:: How to use @gnule{} from languages other than C. * Utility functions:: Utility functions of @gnule{}. * Existing Plugins:: What plugins are available. * Writing new Plugins:: How to write new plugins for @gnule{}. * Internal utility functions:: Utility functions of @gnule{} for writing plugins. * Reporting bugs:: How to report bugs or request new features. Appendices * Copying:: The GNU General Public License says how you can copy and share some parts of @gnule{}. Indices * Concept Index:: Index of concepts and programs. * Function and Data Index:: Index of functions, variables and data types. * Type Index:: Index of data types. @end menu @c ********************************************************** @c ******************* Introduction *********************** @c ********************************************************** @node Introduction @chapter Introduction @cindex error handling @gnule{} is GNU's library for extracting meta data from files. Meta data includes format information (such as mime type, image dimensions, color depth, recording frequency), content descriptions (such as document title or document description) and copyright information (such as license, author and contributors). Meta data extraction is an inherently uncertain business --- a parse error can be a corrupt file, an incompatibility in the file format version, an entirely different file format or a bug in the parser. As a result of this uncertainty, @gnule{} deliberately avoids to ever report any errors. Unexpected file contents simply result in less or possibly no meta data being extracted. @cindex plugin @gnule{} uses plugins to handle various file formats. Technically a plugin can support multiple file formats; however, most plugins only support one particular format. By default, @gnule{} will use all plugins that are available and found in the plugin installation directory. Applications can request the use of only specific plugins or the exclusion of certain plugins. @gnule{} is distributed with the @command{extract} command@footnote{Some distributions ship @command{extract} in a seperate package.} which is a command-line tool for extracting meta data. @command{extract} is given a list of filenames and prints the resulting meta data to the console. The @command{extract} source code also serves as an advanced example for how to use @gnule{}. This manual focuses on providing documentation for writing software with @gnule{}. The only relevant parts for end-users are the chapter on compiling and installing @gnule{} (@xref{Preparation}.). Also, the chapter on existing plugins maybe of interest (@xref{Existing Plugins}.). Additional documentation for end-users can be find in the man page on @command{extract} (using @verb{|man extract|}). @cindex license @gnule{} is licensed under the GNU General Public License. The developers have frequently received requests to license GNU libextractor under alternative terms. However, @gnule{} borrows plenty of GPL-licensed code from various other projects. Hence we cannot change the license (even if we wanted to).@footnote{It maybe possible to switch to GPLv3 in the future. For this, an audit of the license status of our dependencies would be required. The new code that was developed specifically for @gnule{} has always been licensed under GPLv2 @emph{or any later version}.} @node Preparation @chapter Preparation Compiling @gnule{} follows the standard GNU autotools build process using @command{configure} and @command{make}. For details, read the @file{INSTALL} file and query @verb{|./configure --help|} for additional options. @gnule{} has various dependencies, some of which are optional. Instead of specifying the names of the software packages, we will give the list in terms of the names of the respective Debian (unstable) packages that should be installed. You absolutely need: @itemize @bullet @item libtool @item gcc @item make @item g++ @item libltdl7-dev @item zlib1g-dev @item libbz2-dev @end itemize Recommended dependencies are: @itemize @bullet @item libgtk2.0-dev @item libvorbis-dev @item libflac-dev @item libgsf-1-dev @item libmpeg2-4-dev @item libqt4-dev @item librpm-dev @item libpoppler-dev @item libexiv2-dev @end itemize Optional dependencies (you would need to additionally specify the configure option @code{--enable-ffmpeg}) to make use of these are: @itemize @bullet @item libavformat-dev @item libswscale-dev @end itemize For Subversion access and compilation one also needs: @itemize @bullet @item subversion @item autoconf @item automake @end itemize Please notify us if we missed some dependencies (note that the list is supposed to only list direct dependencies, not transitive dependencies). Once you have compiled and installed @gnule{}, you should have a file @file{extractor.h} installed in your @file{include/} directory. This file should be the starting point for your C and C++ development with @gnule{}. The build process also installs the @file{extract} binary and man pages for @file{extract} and @gnule{}. The @file{extract} man page documents the @file{extract} tool. The @gnule{} man page gives a brief summary of the C API for @gnule{}. @cindex packageing @cindex directory structure @cindex plugin @cindex environment variables @vindex LIBEXTRACTOR_PREFIX When you install @gnule{}, various plugins will be installed in the @file{lib/libextractor/} directory. The main library will be installed as @file{lib/libextractor.so}. Note that @gnule{} will attempt to find the plugins relative to the path of the main library. Consequently, a package manager can move the library and its plugins to a different location later --- as long as the relative path between the main library and the plugins is preserved. As a method of last resort, the user can specify an environment variable @verb{|LIBEXTRACTOR_PREFIX|}. If @gnule{} cannot locate a plugin, it will look in @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}. @section Note to package maintainers The suggested way to package GNU libextractor is to split it into roughly the following binary packages:@footnote{Debian policy furthermore requires a @file{-dev} (meta) package that would depend on all of the above packages.} @itemize @bullet @item libextractor (main library only, only hard dependency for other packages depending on GNU libextractor) @item extract (command-line tool and man page) @item libextractor-dev (extractor.h header and man page) @item libextractor-doc (this manual) @item libextractor-plugins (plugins without external dependencies; recommended but not required by extract and libextractor package) @item libextractor-plugin-XXX (plugin with dependency on libXXX, for example for XXX=mpeg this would be @file{libextractor_mpeg.so}) @item libextractor-plugins-all (meta package that requires all plugins) @end itemize This would enable minimal installations (i.e. for embedded systems) to not include any plugins, as well as moderate-size installations (that do not trigger GTK, QT and X11) for systems that have limited resources. @node Generalities @chapter Generalities Each public symbol exported by @gnule{} has the prefix @verb{|EXTRACTOR_|}. All-caps names are used for constants. For the impatient, the minimal C code for using @gnule{} (on the executing binary itself) looks like this: @verbatim #include int main(int argc, char ** argv) { struct EXTRACTOR_PluginList *plugins = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); EXTRACTOR_extract (plugins, argv[1], NULL, 0, &EXTRACTOR_meta_data_print, stdout); EXTRACTOR_plugin_remove_all (plugins); return 0; } @end verbatim @node Extracting meta data @chapter Extracting meta data In order to extract meta data with @gnule{} you first need to load the respective plugins and then call the extraction API with the plugins and the data to process. This section documents how to load and unload plugins, the various types and formats in which meta data is returned to the application and finally the extraction API itself. @menu * Plugin management:: How to load and unload plugins * Meta types:: About meta types * Meta formats:: About meta formats * Extracting:: How to use the extraction API @end menu @node Plugin management @section Plugin management @cindex reentrant @cindex concurrency @cindex threads @cindex thread-safety @tindex enum EXTRACTOR_Options All of the functions for loading and unloading plugins, including @verb{|EXTRACTOR_plugin_add_defaults|} and @verb{|EXTRACTOR_plugin_remove_all|}, are thread-safe and reentrant. However, using the same plugin list from multiple threads at the same time is not safe. Creating multiple plugin lists and using them concurrently is supported as long as the @code{EXTRACTOR_OPTION_IN_PROCESS} option is not used. Generally, @gnule{} is fully thread-safe and mostly reentrant. All plugin code is expected required to be reentrant and state-less, but due to the extensive use of 3rd party libraries this cannot be guaranteed. Hence plugins are executed (by default) out of process. This also ensures that plugins that crash do not cause the main application to fail as well. Plugins can be executed in-process by giving the option @code{EXTRACTOR_OPTION_IN_PROCESS} when loading the plugin. This option is only recommended when debugging plugins and not for production use. Due to the use of shared-memory IPC the out-of-process execution of plugins should not be a concern for performance. @deftp {C Struct} EXTRACTOR_PluginList @tindex struct EXTRACTOR_PluginList A plugin list represents a set of GNU libextractor plugins. Most of the GNU libextractor API is concerned with either constructing a plugin list or using it to extract meta data. The internal representation of the plugin list is of no concern to users or plugin developers. @end deftp @deftypefun void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins) @findex EXTRACTOR_plugin_remove_all Unload all of the plugins in the given list. @end deftypefun @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name) @findex EXTRACTOR_plugin_remove Unloads a particular plugin. The given name should be the short name of the plugin, for example ``mime'' for the mime-type extractor or ``mpeg'' for the MPEG extractor. @end deftypefun @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags) @findex EXTRACTOR_plugin_add Loads a particular plugin. The plugin is added to the existing list, which can be NULL. The second argument specifies the name of the plugin (i.e. ``ogg''). The third argument can be NULL and specifies plugin-specific options. Finally, the last argument specifies if the plugin should be executed out-of-process (@code{EXTRACTOR_OPTION_DEFAULT_POLICY}) or not. @end deftypefun @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags) @findex EXTRACTOR_plugin_add_config Loads and unloads plugins based on a configuration string, modifying the existing list, which can be NULL. The string has the format ``[-]NAME(OPTIONS)@{:[-]NAME(OPTIONS)@}*''. Prefixing the plugin name with a ``-'' means that the plugin should be unloaded. @end deftypefun @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags) @findex EXTRACTOR_plugin_add_defaults Loads all of the plugins in the plugin directory. This function is what most @gnule{} applications should use to setup the plugins. @end deftypefun @node Meta types @section Meta types @tindex enum EXTRACTOR_MetaType @findex EXTRACTOR_metatype_get_max @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data. The total number can differ between different @gnule{} releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function. All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}. @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type) @findex EXTRACTOR_metatype_to_string @cindex gettext @cindex internationalization The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type. The string can be translated into other languages using GNU gettext with the domain set to @gnule{} (@verb{|dgettext("libextractor", s)|}). @end deftypefun @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type) @findex EXTRACTOR_metatype_to_description @cindex gettext @cindex internationalization The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type. The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive. The string can be translated into other languages using GNU gettext with the domain set to @gnule{} (@verb{|dgettext("libextractor", s)|}). @end deftypefun @node Meta formats @section Meta formats @tindex enum EXTRACTOR_MetaFormat @verb{|enum EXTRACTOR_MetaFormat|} is a C enum which defines on a high level how the extracted meta data is represented. Currently, the library uses three formats: UTF-8 strings, C strings and binary data. A fourth value, @code{EXTRACTOR_METAFORMAT_UNKNOWN} is defined but not used. UTF-8 strings are 0-terminated strings that have been converted to UTF-8. The format code is @code{EXTRACTOR_METAFORMAT_UTF8}. Ideally, most text meta data will be of this format. Some file formats fail to specify the encoding used for the text. In this case, the text cannot be converted to UTF-8. However, the meta data is still known to be 0-terminated and presumably human-readable. In this case, the format code used is @code{EXTRACTOR_METAFORMAT_C_STRING}; however, this should not be understood to mean that the encoding is the same as that used by the C compiler. Finally, for binary data (mostly images), the format @code{EXTRACTOR_METAFORMAT_BINARY} is used. Naturally this is not a precise description of the meta format. Plugins can provide a more precise description (if known) by providing the respective mime type of the meta data. For example, binary image meta data could be also tagged as ``image/png'' and normal text would typically be tagged as ``text/plain''. @node Extracting @section Extracting @deftypefn {Function Pointer} int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len) @tindex EXTRACTOR_MetaDataProcessor Type of a function that libextractor calls for each meta data item found. @table @var @item cls closure (user-defined) @item plugin_name name of the plugin that produced this value; special values can be used (i.e. '' for zlib being used in the main libextractor library and yielding meta data); @item type libextractor-type describing the meta data; @item format basic format information about data @item data_mime_type mime-type of data (not of the original file); can be NULL (if mime-type is not known); @item data actual meta-data found @item data_len number of bytes in data @end table Return 0 to continue extracting, 1 to abort. @end deftypefn @deftypefun void EXTRACTOR_extract(struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls) @findex EXTRACTOR_extract @cindex reentrant @cindex concurrency @cindex threads @cindex thread-safety This is the main function for extracting keywords with @gnule{}. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The @samp{filename} argument is optional and can be used to specify the name of a file to process. If @samp{filename} is NULL, then the @samp{data} argument must point to the in-memory data to extract meta data from. If @samp{filename} is non-NULL, @samp{data} can be NULL. If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes. Otherwise @samp{size} should be zero. For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}. The other arguments to @samp{proc} depend on the specific meta data found. @cindex SIGBUS @cindex bus error Meta data extraction should never really fail --- at worst, @gnule{} should not call @samp{proc} with any meta data. By design, @gnule{} should never crash or leak memory, even given corrupt files as input. Note however, that running @gnule{} on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process. While @gnule{} runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. @gnule{} will @emph{not} attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running @gnule{} itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}). @end deftypefun @node Language bindings @chapter Language bindings @cindex Java @cindex Mono @cindex Perl @cindex Python @cindex PHP @cindex Ruby @gnule{} works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main @gnule{} website. Documentation for these bindings (if available) is part of the downloads for the respective binding. In all cases, a full installation of the C library is required before the binding can be installed. @section Java Compiling the GNU libextractor Java binding follows the usual process of running @command{configure} and @command{make}. The result will be a shared C library @file{libextractor_java.so} with the native code and a JAR file (installed to @file{$PREFIX/share/java/libextractor.java}). A minimal example for using GNU libextractor's Java binding would look like this: @verbatim import org.gnu.libextractor.*; import java.util.ArrayList; public static void main(String[] args) { Extractor ex = Extractor.getDefault(); for (int i=0;i