libextractor

GNU libextractor
Log | Files | Refs | Submodules | README | LICENSE

libextractor.texi (36264B)


      1 \input texinfo                  @c -*- Texinfo -*-
      2 @c % The structure of this document is based on the
      3 @c % Texinfo manual from libgcrypt by Werner Koch and 
      4 @c % and Moritz Schulte.
      5 @c %**start of header
      6 @setfilename libextractor.info
      7 @include version.texi
      8 @settitle The GNU libextractor Reference Manual
      9 @c Unify all the indices into concept index.
     10 @syncodeindex fn cp
     11 @syncodeindex vr cp
     12 @syncodeindex ky cp
     13 @syncodeindex pg cp
     14 @syncodeindex tp cp
     15 @c %**end of header
     16 @copying
     17 This manual is for GNU libextractor
     18 (version @value{VERSION}, @value{UPDATED}), a library for metadata
     19 extraction.
     20 
     21 Copyright @copyright{} 2007, 2010, 2012 Christian Grothoff
     22 
     23 @quotation
     24 Permission is granted to copy, distribute and/or modify this document
     25 under the terms of the GNU Free Documentation License, Version 1.3
     26 or any later version published by the Free Software Foundation;
     27 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
     28 Texts.  A copy of the license is included in the section entitled ``GNU
     29 Free Documentation License''.
     30 @end quotation
     31 @end copying
     32 
     33 @dircategory Software libraries
     34 @direntry
     35 * Libextractor: (libextractor).    Metadata extraction library.
     36 @end direntry
     37 
     38 
     39 
     40 @c
     41 @c Titlepage
     42 @c
     43 @titlepage
     44 @title The GNU libextractor Reference Manual
     45 @subtitle Version @value{VERSION}
     46 @subtitle @value{UPDATED}
     47 @author Christian Grothoff (@email{christian@@grothoff.org})
     48 
     49 @page
     50 @vskip 0pt plus 1filll
     51 @insertcopying
     52 @end titlepage
     53 
     54 @summarycontents
     55 @contents
     56 
     57 
     58 @ifnottex
     59 @node Top
     60 @top The GNU libextractor Reference Manual
     61 @insertcopying
     62 @end ifnottex
     63 
     64 @menu
     65 * Introduction::                 What is GNU libextractor.
     66 * Preparation::                  What you should do before using the library.
     67 * Generalities::                 General library functions and data types.
     68 * Extracting meta data::         How to use GNU libextractor to obtain meta data.
     69 * Language bindings::            How to use GNU libextractor from languages other than C.
     70 * Utility functions::            Utility functions of GNU libextractor.
     71 * Existing Plugins::             What plugins are available.
     72 * Writing new Plugins::          How to write new plugins for GNU libextractor.
     73 * Internal utility functions::   Utility functions of GNU libextractor for writing plugins.
     74 * Reporting bugs::               How to report bugs or request new features.
     75 
     76 Appendices
     77 
     78 * GNU Free Documentation License::  Copying this manual.
     79 
     80 Indices
     81 
     82 * Index::                       Index
     83 @c * Function and Data Index::     Index of functions, variables and data types.
     84 @c * Type Index::                  Index of data types.
     85 
     86 @end menu
     87 
     88 
     89 
     90 @c **********************************************************
     91 @c *******************  Introduction  ***********************
     92 @c **********************************************************
     93 @node Introduction
     94 @chapter Introduction
     95 
     96 @cindex error handling
     97 GNU libextractor is GNU's library for extracting meta data from
     98 files.  Meta data includes format information (such as mime type,
     99 image dimensions, color depth, recording frequency), content
    100 descriptions (such as document title or document description) and
    101 copyright information (such as license, author and contributors).
    102 Meta data extraction is an inherently uncertain business --- a parse
    103 error can be a corrupt file, an incompatibility in the file format
    104 version, an entirely different file format or a bug in the parser.  As
    105 a result of this uncertainty, GNU libextractor deliberately
    106 avoids to ever report any errors.  Unexpected file contents simply
    107 result in less or possibly no meta data being extracted.  
    108 
    109 @cindex plugin
    110 GNU libextractor uses plugins to handle various file formats.
    111 Technically a plugin can support multiple file formats; however, most
    112 plugins only support one particular format.  By default,
    113 GNU libextractor will use all plugins that are available and found
    114 in the plugin installation directory.  Applications can
    115 request the use of only specific plugins or the exclusion of
    116 certain plugins.
    117 
    118 GNU libextractor is distributed with the @command{extract} 
    119 command@footnote{Some distributions ship @command{extract} in a
    120 seperate package.} which is a command-line tool for extracting
    121 meta data.  @command{extract} is given a list of filenames and 
    122 prints the resulting meta data to the console.  The @command{extract}
    123 source code also serves as an advanced example for how to use
    124 GNU libextractor.  
    125 
    126 This manual focuses on providing documentation for writing software
    127 with GNU libextractor.  The only relevant parts for end-users
    128 are the chapter on compiling and installing GNU libextractor
    129 (@xref{Preparation}.).  Also, the chapter on existing plugins maybe of
    130 interest (@xref{Existing Plugins}.).  Additional documentation for
    131 end-users can be find in the man page on @command{extract} (using
    132 @verb{|man extract|}).
    133 
    134 @cindex license
    135 GNU libextractor is licensed under the GNU General Public License,
    136 specifically, since version 0.7, GNU libextractor is licensed under GPLv3
    137 @emph{or any later version}.
    138 
    139 @node Preparation
    140 @chapter Preparation
    141 
    142 This chapter first describes the general build instructions that
    143 should apply to all systems.  Specific instructions for known problems
    144 for particular platforms are then described in individual sections
    145 afterwards.
    146 
    147 Compiling GNU libextractor follows the standard GNU autotools build process
    148 using @command{configure} and @command{make}.  For details on the GNU
    149 autotools build process, read the @file{INSTALL} file and query
    150 @verb{|./configure --help|} for additional options.  
    151 
    152 GNU libextractor has various dependencies, most of which are optional. 
    153 Instead of specifying the names of the software packages, we
    154 will give the list in terms of the names of the respective
    155 Debian (wheezy) packages that should be installed.
    156 
    157 You absolutely need:
    158 
    159 @itemize @bullet
    160 @item
    161 libtool
    162 @item
    163 gcc
    164 @item
    165 make
    166 @item
    167 g++ 
    168 @item
    169 libltdl7-dev
    170 @end itemize
    171 
    172 Recommended dependencies are:
    173 @itemize @bullet
    174 @item
    175 zlib1g-dev
    176 @item
    177 libbz2-dev
    178 @item
    179 libgif-dev
    180 @item
    181 libvorbis-dev
    182 @item
    183 libflac-dev
    184 @item
    185 libmpeg2-4-dev
    186 @item
    187 librpm-dev
    188 @item
    189 libgtk2.0-dev or libgtk3.0-dev
    190 @item
    191 libgsf-1-dev
    192 @item
    193 libqt4-dev
    194 @item
    195 libpoppler-dev
    196 @item
    197 libexiv2-dev
    198 @item
    199 libavformat-dev
    200 @item
    201 libswscale-dev
    202 @item
    203 libgstreamer1.0-dev
    204 @end itemize
    205 
    206 For Subversion access and compilation one also needs:
    207 @itemize @bullet
    208 @item
    209 subversion
    210 @item
    211 autoconf
    212 @item
    213 automake
    214 @end itemize
    215 
    216 Please notify us if we missed some dependencies (note that the list is
    217 supposed to only list direct dependencies, not transitive
    218 dependencies).
    219 
    220 Once you have compiled and installed GNU libextractor, you should have a file
    221 @file{extractor.h} installed in your @file{include/} directory.  This
    222 file should be the starting point for your C and C++ development with
    223 GNU libextractor.  The build process also installs the @file{extract} binary and
    224 man pages for @file{extract} and GNU libextractor.  The @file{extract} man page
    225 documents the @file{extract} tool.  The GNU libextractor man page gives a brief
    226 summary of the C API for GNU libextractor.
    227 
    228 @cindex packageing
    229 @cindex directory structure
    230 @cindex plugin
    231 @cindex environment variables
    232 @vindex LIBEXTRACTOR_PREFIX
    233 When you install GNU libextractor, various plugins will be
    234 installed in the @file{lib/libextractor/} directory.  The main library
    235 will be installed as @file{lib/libextractor.so}.  Note that
    236 GNU libextractor will attempt to find the plugins relative to the
    237 path of the main library.  Consequently, a package manager can move
    238 the library and its plugins to a different location later --- as long
    239 as the relative path between the main library and the plugins is
    240 preserved.  As a method of last resort, the user can specify an
    241 environment variable @verb{|LIBEXTRACTOR_PREFIX|}.  If
    242 GNU libextractor cannot locate a plugin, it will look in
    243 @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}.
    244 
    245 
    246 @section Installation on GNU/Linux
    247 
    248 Should work using the standard instructions without problems.
    249 
    250 
    251 @section Installation on FreeBSD
    252 
    253 Should work using the standard instructions without problems.
    254 
    255 
    256 @section Installation on OpenBSD
    257 
    258 OpenBSD 3.8 also doesn't have CODESET in @file{langinfo.h}.  CODESET
    259 is used in GNU libextractor in about three places.  This causes problems
    260 during compilation.
    261 
    262 
    263 @section Installation on NetBSD
    264 
    265 No reports so far.
    266 
    267 
    268 @section Installation using MinGW
    269 
    270 Linking -lstdc++ with the provided libtool fails on Cygwin, this
    271 is a problem with libtool, there is unfortunately no flag to tell
    272 libtool how to do its job on Cygwin and it seems that it cannot be the
    273 default to set the library check to 'pass_all'.  Patching libtool may
    274 help.
    275 
    276 Note: this is a rather dated report and may no longer apply.
    277 
    278 
    279 @section Installation on OS X
    280 
    281 libextractor has two installation methods on Mac OS X: it can be
    282 installed as a Mac OS X framework or with the standard
    283 @command{./configure; make; make install} shell commands. The
    284 framework package is self-contained, but currently omits some of the
    285 extractor plugins that can be compiled in if libextractor is installed
    286 with @command{./configure; make; make install} (provided that the
    287 required dependencies exist.)
    288 
    289 @subsection Installing and uninstalling the framework
    290 
    291 The binary framework is distributed as a disk image (@file{Extractor-x.x.xx.dmg}).
    292 Installation is done by opening the disk image and clicking @file{Extractor.pkg}
    293 inside it. The Mac OS X installer application will then run. The framework
    294 is installed to the root volume's @file{/Library/Frameworks} folder and installing
    295 will require admin privileges.
    296 
    297 The framework can be uninstalled by dragging @*
    298 @file{/Library/Frameworks/Extractor.framework} to the @file{Trash}.
    299 
    300 
    301 @subsection Using the framework
    302 
    303 In the framework, the @command{extract} command line tool can be found at @*
    304 @file{/Library/Frameworks/Extractor.framework/Versions/Current/bin/extract}
    305 
    306 The framework can be used in software projects as a framework or as a dynamic
    307 library. 
    308 
    309 When using the framework as a dynamic library in projects using autotools,
    310 one would most likely want to add  @*
    311 "-I/Library/Frameworks/Extractor.framework/Versions/Current/include"
    312 to CPPFLAGS and @*
    313 "-L/Library/Frameworks/Extractor.framework/Versions/Current/lib"
    314 to LDFLAGS.
    315 
    316 
    317 @subsection Example for using the framework
    318 
    319 @example
    320 @verbatim
    321 // hello.c
    322 #include <Extractor/extractor.h>
    323 
    324 int
    325 main (int argc, char **argv)
    326 {
    327   struct EXTRACTOR_PluginList *el;
    328   el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
    329   // ...
    330   EXTRACTOR_plugin_remove_all (el);
    331   return 0;
    332 }
    333 @end verbatim
    334 @end example
    335 
    336 You can then compile the example using
    337 
    338 @verbatim
    339 $ gcc -o hello hello.c -framework Extractor
    340 @end verbatim
    341 
    342 @subsection Example for using the dynamic library
    343 
    344 @example
    345 @verbatim
    346 // hello.c
    347 #include <extractor.h>
    348 int main()
    349 {
    350   struct EXTRACTOR_PluginList *el;
    351   el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
    352   // ...
    353   EXTRACTOR_plugin_remove_all (el);
    354   return 0;
    355 }
    356 @end verbatim
    357 @end example
    358 
    359 You can then compile the example using
    360 
    361 @verbatim
    362 $ gcc -I/Library/Frameworks/Extractor.framework/Versions/Current/include \
    363   -o hello hello.c \
    364   -L/Library/Frameworks/Extractor.framework/Versions/Current/lib \
    365   -lextractor
    366 @end verbatim
    367 
    368 Notice the difference in the @code{#include} line.
    369 
    370 
    371 
    372 
    373 
    374 
    375 @section Note to package maintainers
    376 
    377 The suggested way to package GNU libextractor is to split it into
    378 roughly the following binary packages:
    379 
    380 @itemize @bullet
    381 @item
    382 libextractor (main library only, only hard dependency for other packages depending on GNU libextractor)
    383 @item
    384 extract (command-line tool and man page extract.1)
    385 @item
    386 libextractor-dev (extractor.h header and man page libextractor.3)
    387 @item
    388 libextractor-doc (this manual)
    389 @item
    390 libextractor-plugins (plugins without external dependencies; recommended but not required by extract and libextractor package)
    391 @item
    392 libextractor-plugin-XXX (plugin with dependency on libXXX, for example for XXX=mpeg this would be @file{libextractor_mpeg.so})
    393 @item
    394 libextractor-plugins-all (meta package that requires all plugins except experimental plugins)
    395 @end itemize
    396 
    397 This would enable minimal installations (i.e. for embedded systems) to
    398 not include any plugins, as well as moderate-size installations (that
    399 do not trigger GTK and X11) for systems that have limited resources.
    400 Right now, the MP4 plugin is experimental and does nothing and should
    401 thus never be included at all; QuickTime, MP4, M4A and 3GPP files are
    402 instead handled by the @file{libextractor_qt.so} plugin, which only
    403 depends on zlib and is part of libextractor-plugins.  The gstreamer plugin is experimental
    404 but largely works with the correct version of gstreamer and can thus
    405 be packaged (especially if the dependency is available on the target
    406 system) but should probably not be part of libextractor-plugins-all.
    407 
    408 
    409 @node Generalities
    410 @chapter Generalities
    411 
    412 @section Introduction to the ``extract'' command
    413 
    414 The @command{extract} command takes a list of file names as arguments,
    415 extracts meta data from each of those files and prints the result to
    416 the console.  By default, @command{extract} will use all available
    417 plugins and print all (non-binary) meta data that is found.
    418 
    419 The set of plugins used by @command{extract} can be controlled using
    420 the ``-l'' and ``-n'' options.  Use ``-n'' to not load all of the
    421 default plugins.  Use ``-l NAME'' to specifically load a certain
    422 plugin.  For example, specify ``-n -l mime'' to only use the MIME
    423 plugin.
    424 
    425 Using the ``-p'' option the output of @command{extract} can be limited
    426 to only certain keyword types.  Similarly, using the ``-x'' option,
    427 certain keyword types can be excluded.  A list of all known keyword
    428 types can be obtained using the ``-L'' option.
    429 
    430 The output format of @command{extract} can be influenced with the
    431 ``-V'' (more verbose, lists filenames), ``-g'' (grep-friendly, all
    432 meta data on a single line per file) and ``-b'' (bibTeX style)
    433 options.
    434 
    435 @section Common usage examples for ``extract''
    436 
    437 @example
    438 $ extract test/test.jpg
    439 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
    440 mimetype - image/jpeg
    441 
    442 $ extract -V -x comment test/test.jpg
    443 Keywords for file test/test.jpg:
    444 mimetype - image/jpeg
    445 
    446 $ extract -p comment test/test.jpg
    447 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
    448 
    449 $ extract -nV -l png.so -p comment test/test.jpg test/test.png
    450 Keywords for file test/test.jpg:
    451 Keywords for file test/test.png:
    452 comment - Testing keyword extraction
    453 @end example
    454 
    455 
    456 @section Introduction to the libextractor library
    457 
    458 Each public symbol exported by GNU libextractor has the prefix
    459 @verb{|EXTRACTOR_|}.  All-caps names are used for constants.  For the
    460 impatient, the minimal C code for using GNU libextractor (on the
    461 executing binary itself) looks like this:
    462 
    463 @verbatim
    464 #include <extractor.h>
    465 
    466 int 
    467 main (int argc, char ** argv) 
    468 {
    469   struct EXTRACTOR_PluginList *plugins
    470     = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
    471   EXTRACTOR_extract (plugins, argv[1],
    472                      NULL, 0, 
    473                      &EXTRACTOR_meta_data_print, stdout);
    474   EXTRACTOR_plugin_remove_all (plugins);
    475   return 0;
    476 }
    477 @end verbatim
    478 
    479 The minimal API illustrated by this example is actually sufficient for
    480 many applications.  The full external C API of GNU libextractor is described
    481 in chapter @xref{Extracting meta data}.  Bindings for other languages
    482 are described in chapter @xref{Language bindings}.  The API for
    483 writing new plugins is described in chapter @xref{Writing new Plugins}.
    484 
    485 Note that it is possible for GNU libextractor to encounter a @code{SIGPIPE}
    486 during its execution.  GNU libextractor --- as it is a library and as such
    487 should not interfere with your main application --- does NOT install a 
    488 signal handler for @code{SIGPIPE}.  You thus need to install a signal
    489 handler (or at least tell your system to ignore @code{SIGPIPE}) if you
    490 want to avoid unexpected problems during calls to GNU libextractor.  
    491 @cindex SIGPIPE
    492 
    493 @node Extracting meta data
    494 @chapter Extracting meta data
    495 
    496 In order to extract meta data with GNU libextractor you first need to
    497 load the respective plugins and then call the extraction API
    498 with the plugins and the data to process.  This section
    499 documents how to load and unload plugins, the various types
    500 and formats in which meta data is returned to the application
    501 and finally the extraction API itself.
    502 
    503 @menu
    504 * Plugin management::   How to load and unload plugins
    505 * Meta types::          About meta types
    506 * Meta formats::        About meta formats
    507 * Extracting::          How to use the extraction API
    508 @end menu
    509 
    510 
    511 @node Plugin management
    512 @section Plugin management
    513 
    514 @cindex reentrant
    515 @cindex concurrency
    516 @cindex threads
    517 @cindex thread-safety
    518 @tindex enum EXTRACTOR_Options
    519 
    520 Using GNU libextractor from a multi-threaded parent process requires some
    521 care.  The problem is that on most platforms GNU libextractor starts
    522 sub-processes for the actual extraction work.  This is useful to
    523 isolate the parent process from potential bugs; however, it can cause
    524 problems if the parent process is multi-threaded.  The issue is that
    525 at the time of the fork, another thread of the application may hold a
    526 lock (i.e. in gettext or libc).  That lock would then never be
    527 released in the child process (as the other thread is not present in
    528 the child process).  As a result, the child process would then
    529 deadlock on trying to acquire the lock and never terminate.  This has
    530 actually been observed with a lock in GNU gettext that is triggered by
    531 the plugin startup code when it interacts with libltdl.
    532 
    533 The problem can be solved by loading the plugins using the
    534 @code{EXTRACTOR_OPTION_IN_PROCESS} option, which will run GNU libextractor
    535 in-process and thus avoid the locking issue.  In this case, all of the
    536 functions for loading and unloading plugins, including
    537 @verb{|EXTRACTOR_plugin_add_defaults|} and
    538 @verb{|EXTRACTOR_plugin_remove_all|}, are thread-safe and reentrant.
    539 However, using the same plugin list from multiple threads at the same
    540 time is not safe.  
    541 
    542 All plugin code is expected required to be reentrant and state-less,
    543 but due to the extensive use of 3rd party libraries this cannot
    544 be guaranteed.
    545 
    546 
    547 @deftp {C Struct} EXTRACTOR_PluginList
    548 @tindex struct EXTRACTOR_PluginList
    549 
    550 A plugin list represents a set of GNU libextractor plugins.  Most of
    551 the GNU libextractor API is concerned with either constructing a
    552 plugin list or using it to extract meta data.  The internal representation
    553 of the plugin list is of no concern to users or plugin developers.
    554 @end deftp
    555 
    556 
    557 @deftypefun void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins)
    558 @findex EXTRACTOR_plugin_remove_all
    559 
    560 Unload all of the plugins in the given list.
    561 @end deftypefun
    562 
    563 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name)
    564 @findex EXTRACTOR_plugin_remove
    565 
    566 Unloads a particular plugin.  The given name should be the short name of the plugin, for example ``mime'' for the mime-type extractor or ``mpeg'' for the MPEG extractor.
    567 @end deftypefun
    568 
    569 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags)
    570 @findex EXTRACTOR_plugin_add
    571 
    572 Loads a particular plugin.  The plugin is added to the existing list, which can be @code{NULL}.  The second argument specifies the name of the plugin (i.e. ``ogg'').  The third argument can be @code{NULL} and specifies plugin-specific options.  Finally, the last argument specifies if the plugin should be executed out-of-process (@code{EXTRACTOR_OPTION_DEFAULT_POLICY}) or not.
    573 @end deftypefun
    574 
    575 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags)
    576 @findex EXTRACTOR_plugin_add_config
    577 
    578 Loads and unloads plugins based on a configuration string, modifying the existing list, which can be @code{NULL}.  The string has the format ``[-]NAME(OPTIONS)@{:[-]NAME(OPTIONS)@}*''.  Prefixing the plugin name with a ``-'' means that the plugin should be unloaded.
    579 @end deftypefun
    580 
    581 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags)
    582 @findex EXTRACTOR_plugin_add_defaults
    583 
    584 Loads all of the plugins in the plugin directory.  This function is what most GNU libextractor applications should use to setup the plugins.
    585 @end deftypefun
    586 
    587 
    588 
    589 @node Meta types
    590 @section Meta types
    591 
    592 
    593 @tindex enum EXTRACTOR_MetaType
    594 @findex EXTRACTOR_metatype_get_max
    595 
    596 @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data.  The total number can differ between different GNU libextractor releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function.  All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}.
    597 
    598 @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type)
    599 @findex EXTRACTOR_metatype_to_string
    600 @cindex gettext
    601 @cindex internationalization
    602 
    603 The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type.  The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}).  
    604 @end deftypefun
    605 
    606 @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type)
    607 @findex EXTRACTOR_metatype_to_description
    608 @cindex gettext
    609 @cindex internationalization
    610 
    611 The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type.  The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive.  The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}).  
    612 @end deftypefun
    613 
    614 
    615 
    616 @node Meta formats
    617 @section Meta formats
    618 
    619 @tindex enum EXTRACTOR_MetaFormat
    620 
    621 @verb{|enum EXTRACTOR_MetaFormat|} is a C enum which defines on a high level how the extracted meta data is represented.  Currently, the library uses three formats: UTF-8 strings, C strings and binary data.  A fourth value, @code{EXTRACTOR_METAFORMAT_UNKNOWN} is defined but not used.  UTF-8 strings are 0-terminated strings that have been converted to UTF-8.  The format code is @code{EXTRACTOR_METAFORMAT_UTF8}. Ideally, most text meta data will be of this format.  Some file formats fail to specify the encoding used for the text.  In this case, the text cannot be converted to UTF-8.  However, the meta data is still known to be 0-terminated and presumably human-readable.  In this case, the format code used is @code{EXTRACTOR_METAFORMAT_C_STRING}; however, this should not be understood to mean that the encoding is the same as that used by the C compiler.  Finally, for binary data (mostly images), the format @code{EXTRACTOR_METAFORMAT_BINARY} is used.
    622 
    623 Naturally this is not a precise description of the meta format. Plugins can provide a more precise description (if known) by providing the respective mime type of the meta data.  For example, binary image meta data could be also tagged as ``image/png'' and normal text would typically be tagged as ``text/plain''.  
    624 
    625 
    626 
    627 @node Extracting
    628 @section Extracting
    629 
    630 @deftypefn {Function Pointer} int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len)
    631 @tindex EXTRACTOR_MetaDataProcessor
    632 
    633 Type of a function that libextractor calls for each meta data item found.
    634 
    635 @table @var
    636 
    637 @item cls 
    638 closure (user-defined)
    639 
    640 @item plugin_name 
    641 name of the plugin that produced this value; special values can be used (i.e. '<zlib>' for zlib being used in the main libextractor library and yielding meta data);
    642 
    643 @item type 
    644 libextractor-type describing the meta data;
    645 
    646 @item format basic 
    647 format information about data
    648 
    649 @item data_mime_type 
    650 mime-type of data (not of the original file); can be @code{NULL} (if mime-type is not known);
    651 
    652 @item data 
    653 actual meta-data found
    654 
    655 @item data_len 
    656 number of bytes in data
    657 
    658 @end table
    659 
    660 Return 0 to continue extracting, 1 to abort.
    661 @end deftypefn
    662 
    663 
    664 
    665 @deftypefun void EXTRACTOR_extract (struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls)
    666 @findex EXTRACTOR_extract
    667 @cindex reentrant
    668 @cindex concurrency
    669 @cindex threads
    670 @cindex thread-safety
    671 
    672 This is the main function for extracting keywords with GNU libextractor.  The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data.  The @samp{filename} argument is optional and can be used to specify the name of a file to process.  If @samp{filename} is @code{NULL}, then the @samp{data} argument must point to the in-memory data to extract meta data from.  If @samp{filename} is non-@code{NULL}, @samp{data} can be @code{NULL}.  If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes.  Otherwise @samp{size} should be zero.  For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}.  The other arguments to @samp{proc} depend on the specific meta data found.  
    673 
    674 @cindex SIGBUS
    675 @cindex bus error
    676 Meta data extraction should never really fail --- at worst, GNU libextractor should not call @samp{proc} with any meta data. By design, GNU libextractor should never crash or leak memory, even given corrupt files as input.  Note however, that running GNU libextractor on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process.  As GNU libextractor typically runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it.  During decompression it is possible to encounter a SIGBUS.   GNU libextractor will @emph{not} attempt to catch this signal and your application is likely to crash.  Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt).  If this is not acceptable, you might want to consider running GNU libextractor itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}).
    677 
    678 @end deftypefun
    679 
    680 
    681 @node Language bindings
    682 @chapter Language bindings
    683 @cindex Java
    684 @cindex Mono
    685 @cindex Perl
    686 @cindex Python
    687 @cindex PHP
    688 @cindex Ruby
    689 
    690 GNU libextractor works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main GNU libextractor website.  Documentation for these bindings (if available) is part of the downloads for the respective binding.  In all cases, a full installation of the C library is required before the binding can be installed.
    691 
    692 @section Java
    693 
    694 Compiling the GNU libextractor Java binding follows the usual process of
    695 running @command{configure} and @command{make}.  The result will be a
    696 shared C library @file{libextractor_java.so} with the native code and
    697 a JAR file (installed to @file{$PREFIX/share/java/libextractor.java}).
    698 
    699 A minimal example for using GNU libextractor's Java binding would look
    700 like this:
    701 @verbatim
    702 import org.gnu.libextractor.*;
    703 import java.util.ArrayList;
    704 
    705 public static void main(String[] args) {
    706   Extractor ex = Extractor.getDefault();
    707   for (int i=0;i<args.length;i++) {
    708     ArrayList keywords = ex.extract(args[i]);
    709     System.out.println("Keywords for " + args[i] + ":");
    710     for (int j=0;j<keywords.size();j++)
    711       System.out.println(keywords.get(j));
    712   }
    713 }
    714 @end verbatim
    715 
    716 The GNU libextractor library and the @file{libextractor_java.so} JNI binding
    717 have to be in the library search path for this to work.  Furthermore, the
    718 @file{libextractor.jar} file should be on the classpath.  
    719 
    720 Note that the API does not use Java 5 style generics in order to work
    721 with older versions of Java.
    722 
    723 @section Mono
    724 
    725 his binding is undocumented at this point.
    726 
    727 @section Perl
    728 
    729 This binding is undocumented at this point.
    730 
    731 @section Python
    732 
    733 This binding is undocumented at this point.
    734 
    735 @section PHP
    736 
    737 This binding is undocumented at this point.
    738 
    739 @section Ruby
    740 
    741 This binding is undocumented at this point.
    742 
    743 
    744 
    745 @node Utility functions
    746 @chapter Utility functions
    747 
    748 @cindex reentrant
    749 @cindex concurrency
    750 @cindex threads
    751 @cindex thread-safety
    752 This chapter describes various utility functions for GNU libextractor usage. All of the functions are reentrant.
    753 
    754 @menu
    755 * Utility Constants::
    756 * Meta data printing::
    757 @end menu
    758 
    759 @node Utility Constants
    760 @section Utility Constants
    761 
    762 @findex EXTRACTOR_VERSION
    763 The constant @verb{|EXTRACTOR_VERSION|} is a hexadecimal
    764 representation of the version number of the installed libextractor
    765 header.  The hexadecimal format is 0xAABBCCDD where AA is the major
    766 version (so far always 0), BB is the minor version, CC is the revision
    767 and DD the patch number.  For example, for version 0.5.18, we would
    768 have AA=0, BB=5, CC=18 and DD=0.  Minor releases such as 0.5.18a or
    769 significant changes in unreleased versions would be marked with DD=1
    770 or higher.
    771 
    772 
    773 @node Meta data printing
    774 @section Meta data printing
    775 
    776 
    777 @findex EXTRACTOR_meta_data_print
    778 The @verb{|EXTRACTOR_meta_data_print|} is a simple function which prints the meta data found with libextractor to a file.  The function is mostly useful for debugging and as an example for how to manipulate the keyword list and can be passed as the @samp{proc} argument to @code{EXTRACTOR_extract}.  The file to print to should be passed as @samp{proc_cls} (which must be of type @code{FILE *}), for example @code{stdout}.
    779 
    780 
    781 
    782 @node Existing Plugins
    783 @chapter Existing Plugins
    784 
    785 @itemize @bullet
    786 @item
    787 ARCHIVE (using libarchive)
    788 @item
    789 DVI
    790 @item
    791 EXIV2 (using libexiv2, 0.23 or later preferred)
    792 @item 
    793 FLAC (using libFLAC)
    794 @item
    795 GIF (using libgif)
    796 @item
    797 GSTREAMER (using libgstreamer v1.0 or later)
    798 @item
    799 HTML (using libtidy)
    800 @item
    801 IT 
    802 @item
    803 JPEG (using libjpeg v8 or later)
    804 @item
    805 MAN
    806 @item
    807 MIDI (using libsmf)
    808 @item
    809 MIME (using libmagic)
    810 @item
    811 MPEG (using libmpeg2)
    812 @item
    813 NSF
    814 @item
    815 NSFE
    816 @item
    817 ODF
    818 @item
    819 OLE2 (with libgsf)
    820 @item
    821 OGG (with libogg)
    822 @item
    823 PNG
    824 @item
    825 PS
    826 @item
    827 RIFF
    828 @item
    829 RPM (using librpm)
    830 @item
    831 S3M
    832 @item
    833 SID
    834 @item
    835 ThumbnailFFMPEG (using libavformat and related libav-libraries, including libswscale)
    836 @item
    837 ThumbnailGtk (using libgtk)
    838 @item
    839 TIFF (with libtiff, tested with v4)
    840 @item
    841 WAV
    842 @item
    843 XM
    844 @item
    845 ZIP
    846 @end itemize
    847 
    848 @file{gzip} and @file{bzip2} compressed versions of these formats are 
    849 also supported (as well as meta data embedded by @file{gzip} itself)
    850 if zlib or libbz2 are available.
    851 
    852 @node Writing new Plugins
    853 @chapter Writing new Plugins
    854 
    855 Writing a new plugin for libextractor usually requires writing of or
    856 interfacing with an actual parser for a specific format.  How this is
    857 can be accomplished depends on the format and cannot be specified in
    858 general.  However, care should be taken for the code to be reentrant
    859 and highly fault-tolerant, especially with respect to malformed
    860 inputs.
    861 
    862 Plugins should start by verifying that the header of the data matches
    863 the specific format and immediately return if that is not the case.
    864 Even if the header matches the expected file format, plugins must not
    865 assume that the remainder of the file is well formed.
    866 
    867 The plugin library must be called libextractor_XXX.so, where XXX 
    868 denotes the file format of the plugin. The library must export a 
    869 method @verb{|libextractor_XXX_extract_method|}, with the following 
    870 signature:
    871 @verbatim
    872 void
    873 EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);
    874 @end verbatim
    875 
    876 @samp{ec} contains various information the plugin may need for its
    877 execution.  Most importantly, it contains functions for reading
    878 (``read'') and seeking (``seek'') the input data and for returning
    879 extracted data (``proc'').  The ``config'' member can contain
    880 additional configuration options.  ``proc'' should be called on
    881 each meta data item found.  If ``proc'' returns non-zero,
    882 processing should be aborted (if possible).
    883 
    884 In order to test new plugins, the @file{extract} command can be run
    885 with the options ``-ni'' and ``-l XXX'' .  This will run the plugin
    886 in-process (making it easier to debug) and without any of the other
    887 plugins.
    888 
    889 
    890 @section Example for a minimal extract method
    891 
    892 The following example shows how a plugin can return the mime type of
    893 a file.
    894 @example
    895 @verbatim
    896 void
    897 EXTRACTOR_mymime_extract (struct EXTRACTOR_ExtractContext *ec)
    898 {
    899   void *data;
    900   ssize_t data_size,
    901 
    902   if (-1 == (data_size = ec->read (ec->cls, &data, 4)))
    903     return; /* read error */
    904   if (data_size < 4)
    905     return; /* file too small */
    906   if (0 != memcmp (data, "\177ELF", 4))
    907     return; /* not ELF */
    908   if (0 != ec->proc (ec->cls, 
    909                      "mymime",
    910                      EXTRACTOR_METATYPE_MIMETYPE,
    911                      EXTRACTOR_METAFORMAT_UTF8,
    912                      "text/plain",
    913                      "application/x-executable",
    914                      1 + strlen("application/x-executable")))
    915     return;
    916   /* more calls to 'proc' here as needed */
    917 }
    918 @end verbatim
    919 @end example
    920 
    921 
    922 @node Internal utility functions
    923 @chapter Internal utility functions
    924 
    925 Some plugins link against the @code{libextractor_common} library which
    926 provides common abstractions needed by many plugins.  This section
    927 documents this internal API for plugin developers.  Note that the headers
    928 for this library are (intentionally) not installed: we do not consider
    929 this API stable and it should hence only be used by plugins that are 
    930 build and shipped with GNU libextractor.  Third-party plugins should
    931 not use it.
    932 
    933 @file{convert_numeric.h} defines various conversion functions for
    934 numbers (in particular, byte-order conversion for floating point
    935 numbers).  
    936 
    937 @file{unzip.h} defines an API for accessing compressed files.
    938 
    939 @file{pack.h} provides an interpreter for unpacking structs of integer
    940 numbers from streams and converting from big or little endian to host
    941 byte order at the same time.
    942 
    943 @file{convert.h} provides a function for character set conversion described
    944 below.
    945 
    946 @deftypefun {char *} EXTRACTOR_common_convert_to_utf8 (const char *input, size_t len, const char *charset)
    947 @cindex UTF-8
    948 @cindex character set
    949 @findex EXTRACTOR_common_convert_to_utf8
    950 Various GNU libextractor plugins make use of the internal
    951 @file{convert.h} header which defines a function
    952 
    953 @verb{|EXTRACTOR_common_convert_to_utf8|} which can be used to easily convert text from
    954 any character set to UTF-8.  This conversion is important since the
    955 linked list of keywords that is returned by GNU libextractor is
    956 expected to contain only UTF-8 strings.  Naturally, proper conversion
    957 may not always be possible since some file formats fail to specify the
    958 character set.  In that case, it is often better to not convert at
    959 all.
    960 
    961 The arguments to @verb{|EXTRACTOR_common_convert_to_utf8|} are the input string (which
    962 does @emph{not} have to be zero-terminated), the length of the input
    963 string, and the character set (which @emph{must} be zero-terminated).
    964 Which character sets are supported depends on the platform, a list can
    965 generally be obtained using the @command{iconv -l} command.  The
    966 return value from @verb{|EXTRACTOR_common_convert_to_utf8|} is a zero-terminated string
    967 in UTF-8 format.  The responsibility to free the string is with the
    968 caller, so storing the string in the keyword list is acceptable.
    969 @end deftypefun
    970 
    971 
    972 
    973 
    974 
    975 @node Reporting bugs
    976 @chapter Reporting bugs
    977 
    978 @cindex bug
    979 GNU libextractor uses the @url{https://gnunet.org/bugs/,Mantis bugtracking
    980 system}.  If possible, please report bugs there.  You can also e-mail
    981 the GNU libextractor mailinglist at @url{libextractor@@gnu.org}.
    982 
    983 
    984 
    985 @c **********************************************************
    986 @c *******************  Appendices  *************************
    987 @c **********************************************************
    988 
    989 @node GNU Free Documentation License
    990 @appendix GNU Free Documentation License
    991 
    992 @include fdl-1.3.texi
    993 
    994 
    995 @node Index
    996 @unnumbered Index
    997 
    998 @printindex cp
    999 
   1000 @c @node Function and Data Index
   1001 @c @unnumbered Function and Data Index
   1002 @c @printindex fn
   1003 
   1004 @c @node Type Index
   1005 @c @unnumbered Type Index
   1006 @c @printindex tp
   1007 
   1008 @bye