libextractor

GNU libextractor
Log | Files | Refs | Submodules | README | LICENSE

libextractor.texi (36104B)


      1 \input texinfo                  @c -*- Texinfo -*-
      2 @c % The structure of this document is based on the
      3 @c % Texinfo manual from libgcrypt by Werner Koch and 
      4 @c % and Moritz Schulte.
      5 @c %**start of header
      6 @setfilename libextractor.info
      7 @include version.texi
      8 @settitle The GNU libextractor Reference Manual
      9 @c Unify all the indices into concept index.
     10 @syncodeindex fn cp
     11 @syncodeindex vr cp
     12 @syncodeindex ky cp
     13 @syncodeindex pg cp
     14 @syncodeindex tp cp
     15 @c %**end of header
     16 @copying
     17 This manual is for GNU libextractor
     18 (version @value{VERSION}, @value{UPDATED}), a library for metadata
     19 extraction.
     20 
     21 Copyright @copyright{} 2007, 2010, 2012 Christian Grothoff
     22 
     23 @quotation
     24 Permission is granted to copy, distribute and/or modify this document
     25 under the terms of the GNU Free Documentation License, Version 1.3
     26 or any later version published by the Free Software Foundation;
     27 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
     28 Texts.  A copy of the license is included in the section entitled ``GNU
     29 Free Documentation License''.
     30 @end quotation
     31 @end copying
     32 
     33 @dircategory Software libraries
     34 @direntry
     35 * Libextractor: (libextractor).    Metadata extraction library.
     36 @end direntry
     37 
     38 
     39 
     40 @c
     41 @c Titlepage
     42 @c
     43 @titlepage
     44 @title The GNU libextractor Reference Manual
     45 @subtitle Version @value{VERSION}
     46 @subtitle @value{UPDATED}
     47 @author Christian Grothoff (@email{christian@@grothoff.org})
     48 
     49 @page
     50 @vskip 0pt plus 1filll
     51 @insertcopying
     52 @end titlepage
     53 
     54 @summarycontents
     55 @contents
     56 
     57 
     58 @ifnottex
     59 @node Top
     60 @top The GNU libextractor Reference Manual
     61 @insertcopying
     62 @end ifnottex
     63 
     64 @menu
     65 * Introduction::                 What is GNU libextractor.
     66 * Preparation::                  What you should do before using the library.
     67 * Generalities::                 General library functions and data types.
     68 * Extracting meta data::         How to use GNU libextractor to obtain meta data.
     69 * Language bindings::            How to use GNU libextractor from languages other than C.
     70 * Utility functions::            Utility functions of GNU libextractor.
     71 * Existing Plugins::             What plugins are available.
     72 * Writing new Plugins::          How to write new plugins for GNU libextractor.
     73 * Internal utility functions::   Utility functions of GNU libextractor for writing plugins.
     74 * Reporting bugs::               How to report bugs or request new features.
     75 
     76 Appendices
     77 
     78 * GNU Free Documentation License::  Copying this manual.
     79 
     80 Indices
     81 
     82 * Index::                       Index
     83 @c * Function and Data Index::     Index of functions, variables and data types.
     84 @c * Type Index::                  Index of data types.
     85 
     86 @end menu
     87 
     88 
     89 
     90 @c **********************************************************
     91 @c *******************  Introduction  ***********************
     92 @c **********************************************************
     93 @node Introduction
     94 @chapter Introduction
     95 
     96 @cindex error handling
     97 GNU libextractor is GNU's library for extracting meta data from
     98 files.  Meta data includes format information (such as mime type,
     99 image dimensions, color depth, recording frequency), content
    100 descriptions (such as document title or document description) and
    101 copyright information (such as license, author and contributors).
    102 Meta data extraction is an inherently uncertain business --- a parse
    103 error can be a corrupt file, an incompatibility in the file format
    104 version, an entirely different file format or a bug in the parser.  As
    105 a result of this uncertainty, GNU libextractor deliberately
    106 avoids to ever report any errors.  Unexpected file contents simply
    107 result in less or possibly no meta data being extracted.  
    108 
    109 @cindex plugin
    110 GNU libextractor uses plugins to handle various file formats.
    111 Technically a plugin can support multiple file formats; however, most
    112 plugins only support one particular format.  By default,
    113 GNU libextractor will use all plugins that are available and found
    114 in the plugin installation directory.  Applications can
    115 request the use of only specific plugins or the exclusion of
    116 certain plugins.
    117 
    118 GNU libextractor is distributed with the @command{extract} 
    119 command@footnote{Some distributions ship @command{extract} in a
    120 seperate package.} which is a command-line tool for extracting
    121 meta data.  @command{extract} is given a list of filenames and 
    122 prints the resulting meta data to the console.  The @command{extract}
    123 source code also serves as an advanced example for how to use
    124 GNU libextractor.  
    125 
    126 This manual focuses on providing documentation for writing software
    127 with GNU libextractor.  The only relevant parts for end-users
    128 are the chapter on compiling and installing GNU libextractor
    129 (@xref{Preparation}.).  Also, the chapter on existing plugins maybe of
    130 interest (@xref{Existing Plugins}.).  Additional documentation for
    131 end-users can be find in the man page on @command{extract} (using
    132 @verb{|man extract|}).
    133 
    134 @cindex license
    135 GNU libextractor is licensed under the GNU General Public License,
    136 specifically, since version 0.7, GNU libextractor is licensed under GPLv3
    137 @emph{or any later version}.
    138 
    139 @node Preparation
    140 @chapter Preparation
    141 
    142 This chapter first describes the general build instructions that
    143 should apply to all systems.  Specific instructions for known problems
    144 for particular platforms are then described in individual sections
    145 afterwards.
    146 
    147 Compiling GNU libextractor follows the standard GNU autotools build process
    148 using @command{configure} and @command{make}.  For details on the GNU
    149 autotools build process, read the @file{INSTALL} file and query
    150 @verb{|./configure --help|} for additional options.  
    151 
    152 GNU libextractor has various dependencies, most of which are optional. 
    153 Instead of specifying the names of the software packages, we
    154 will give the list in terms of the names of the respective
    155 Debian (wheezy) packages that should be installed.
    156 
    157 You absolutely need:
    158 
    159 @itemize @bullet
    160 @item
    161 libtool
    162 @item
    163 gcc
    164 @item
    165 make
    166 @item
    167 g++ 
    168 @item
    169 libltdl7-dev
    170 @end itemize
    171 
    172 Recommended dependencies are:
    173 @itemize @bullet
    174 @item
    175 zlib1g-dev
    176 @item
    177 libbz2-dev
    178 @item
    179 libgif-dev
    180 @item
    181 libvorbis-dev
    182 @item
    183 libflac-dev
    184 @item
    185 libmpeg2-4-dev
    186 @item
    187 librpm-dev
    188 @item
    189 libgtk2.0-dev or libgtk3.0-dev
    190 @item
    191 libgsf-1-dev
    192 @item
    193 libqt4-dev
    194 @item
    195 libpoppler-dev
    196 @item
    197 libexiv2-dev
    198 @item
    199 libavformat-dev
    200 @item
    201 libswscale-dev
    202 @item
    203 libgstreamer1.0-dev
    204 @end itemize
    205 
    206 For Subversion access and compilation one also needs:
    207 @itemize @bullet
    208 @item
    209 subversion
    210 @item
    211 autoconf
    212 @item
    213 automake
    214 @end itemize
    215 
    216 Please notify us if we missed some dependencies (note that the list is
    217 supposed to only list direct dependencies, not transitive
    218 dependencies).
    219 
    220 Once you have compiled and installed GNU libextractor, you should have a file
    221 @file{extractor.h} installed in your @file{include/} directory.  This
    222 file should be the starting point for your C and C++ development with
    223 GNU libextractor.  The build process also installs the @file{extract} binary and
    224 man pages for @file{extract} and GNU libextractor.  The @file{extract} man page
    225 documents the @file{extract} tool.  The GNU libextractor man page gives a brief
    226 summary of the C API for GNU libextractor.
    227 
    228 @cindex packageing
    229 @cindex directory structure
    230 @cindex plugin
    231 @cindex environment variables
    232 @vindex LIBEXTRACTOR_PREFIX
    233 When you install GNU libextractor, various plugins will be
    234 installed in the @file{lib/libextractor/} directory.  The main library
    235 will be installed as @file{lib/libextractor.so}.  Note that
    236 GNU libextractor will attempt to find the plugins relative to the
    237 path of the main library.  Consequently, a package manager can move
    238 the library and its plugins to a different location later --- as long
    239 as the relative path between the main library and the plugins is
    240 preserved.  As a method of last resort, the user can specify an
    241 environment variable @verb{|LIBEXTRACTOR_PREFIX|}.  If
    242 GNU libextractor cannot locate a plugin, it will look in
    243 @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}.
    244 
    245 
    246 @section Installation on GNU/Linux
    247 
    248 Should work using the standard instructions without problems.
    249 
    250 
    251 @section Installation on FreeBSD
    252 
    253 Should work using the standard instructions without problems.
    254 
    255 
    256 @section Installation on OpenBSD
    257 
    258 OpenBSD 3.8 also doesn't have CODESET in @file{langinfo.h}.  CODESET
    259 is used in GNU libextractor in about three places.  This causes problems
    260 during compilation.
    261 
    262 
    263 @section Installation on NetBSD
    264 
    265 No reports so far.
    266 
    267 
    268 @section Installation using MinGW
    269 
    270 Linking -lstdc++ with the provided libtool fails on Cygwin, this
    271 is a problem with libtool, there is unfortunately no flag to tell
    272 libtool how to do its job on Cygwin and it seems that it cannot be the
    273 default to set the library check to 'pass_all'.  Patching libtool may
    274 help.
    275 
    276 Note: this is a rather dated report and may no longer apply.
    277 
    278 
    279 @section Installation on OS X
    280 
    281 libextractor has two installation methods on Mac OS X: it can be
    282 installed as a Mac OS X framework or with the standard
    283 @command{./configure; make; make install} shell commands. The
    284 framework package is self-contained, but currently omits some of the
    285 extractor plugins that can be compiled in if libextractor is installed
    286 with @command{./configure; make; make install} (provided that the
    287 required dependencies exist.)
    288 
    289 @subsection Installing and uninstalling the framework
    290 
    291 The binary framework is distributed as a disk image (@file{Extractor-x.x.xx.dmg}).
    292 Installation is done by opening the disk image and clicking @file{Extractor.pkg}
    293 inside it. The Mac OS X installer application will then run. The framework
    294 is installed to the root volume's @file{/Library/Frameworks} folder and installing
    295 will require admin privileges.
    296 
    297 The framework can be uninstalled by dragging @*
    298 @file{/Library/Frameworks/Extractor.framework} to the @file{Trash}.
    299 
    300 
    301 @subsection Using the framework
    302 
    303 In the framework, the @command{extract} command line tool can be found at @*
    304 @file{/Library/Frameworks/Extractor.framework/Versions/Current/bin/extract}
    305 
    306 The framework can be used in software projects as a framework or as a dynamic
    307 library. 
    308 
    309 When using the framework as a dynamic library in projects using autotools,
    310 one would most likely want to add  @*
    311 "-I/Library/Frameworks/Extractor.framework/Versions/Current/include"
    312 to CPPFLAGS and @*
    313 "-L/Library/Frameworks/Extractor.framework/Versions/Current/lib"
    314 to LDFLAGS.
    315 
    316 
    317 @subsection Example for using the framework
    318 
    319 @example
    320 @verbatim
    321 // hello.c
    322 #include <Extractor/extractor.h>
    323 
    324 int
    325 main (int argc, char **argv)
    326 {
    327   struct EXTRACTOR_PluginList *el;
    328   el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
    329   // ...
    330   EXTRACTOR_plugin_remove_all (el);
    331   return 0;
    332 }
    333 @end verbatim
    334 @end example
    335 
    336 You can then compile the example using
    337 
    338 @verbatim
    339 $ gcc -o hello hello.c -framework Extractor
    340 @end verbatim
    341 
    342 @subsection Example for using the dynamic library
    343 
    344 @example
    345 @verbatim
    346 // hello.c
    347 #include <extractor.h>
    348 int main()
    349 {
    350   struct EXTRACTOR_PluginList *el;
    351   el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
    352   // ...
    353   EXTRACTOR_plugin_remove_all (el);
    354   return 0;
    355 }
    356 @end verbatim
    357 @end example
    358 
    359 You can then compile the example using
    360 
    361 @verbatim
    362 $ gcc -I/Library/Frameworks/Extractor.framework/Versions/Current/include \
    363   -o hello hello.c \
    364   -L/Library/Frameworks/Extractor.framework/Versions/Current/lib \
    365   -lextractor
    366 @end verbatim
    367 
    368 Notice the difference in the @code{#include} line.
    369 
    370 
    371 
    372 
    373 
    374 
    375 @section Note to package maintainers
    376 
    377 The suggested way to package GNU libextractor is to split it into
    378 roughly the following binary packages:
    379 
    380 @itemize @bullet
    381 @item
    382 libextractor (main library only, only hard dependency for other packages depending on GNU libextractor)
    383 @item
    384 extract (command-line tool and man page extract.1)
    385 @item
    386 libextractor-dev (extractor.h header and man page libextractor.3)
    387 @item
    388 libextractor-doc (this manual)
    389 @item
    390 libextractor-plugins (plugins without external dependencies; recommended but not required by extract and libextractor package)
    391 @item
    392 libextractor-plugin-XXX (plugin with dependency on libXXX, for example for XXX=mpeg this would be @file{libextractor_mpeg.so})
    393 @item
    394 libextractor-plugins-all (meta package that requires all plugins except experimental plugins)
    395 @end itemize
    396 
    397 This would enable minimal installations (i.e. for embedded systems) to
    398 not include any plugins, as well as moderate-size installations (that
    399 do not trigger GTK and X11) for systems that have limited resources.
    400 Right now, the MP4 plugin is experimental and does nothing and should
    401 thus never be included at all.  The gstreamer plugin is experimental
    402 but largely works with the correct version of gstreamer and can thus
    403 be packaged (especially if the dependency is available on the target
    404 system) but should probably not be part of libextractor-plugins-all.
    405 
    406 
    407 @node Generalities
    408 @chapter Generalities
    409 
    410 @section Introduction to the ``extract'' command
    411 
    412 The @command{extract} command takes a list of file names as arguments,
    413 extracts meta data from each of those files and prints the result to
    414 the console.  By default, @command{extract} will use all available
    415 plugins and print all (non-binary) meta data that is found.
    416 
    417 The set of plugins used by @command{extract} can be controlled using
    418 the ``-l'' and ``-n'' options.  Use ``-n'' to not load all of the
    419 default plugins.  Use ``-l NAME'' to specifically load a certain
    420 plugin.  For example, specify ``-n -l mime'' to only use the MIME
    421 plugin.
    422 
    423 Using the ``-p'' option the output of @command{extract} can be limited
    424 to only certain keyword types.  Similarly, using the ``-x'' option,
    425 certain keyword types can be excluded.  A list of all known keyword
    426 types can be obtained using the ``-L'' option.
    427 
    428 The output format of @command{extract} can be influenced with the
    429 ``-V'' (more verbose, lists filenames), ``-g'' (grep-friendly, all
    430 meta data on a single line per file) and ``-b'' (bibTeX style)
    431 options.
    432 
    433 @section Common usage examples for ``extract''
    434 
    435 @example
    436 $ extract test/test.jpg
    437 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
    438 mimetype - image/jpeg
    439 
    440 $ extract -V -x comment test/test.jpg
    441 Keywords for file test/test.jpg:
    442 mimetype - image/jpeg
    443 
    444 $ extract -p comment test/test.jpg
    445 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1
    446 
    447 $ extract -nV -l png.so -p comment test/test.jpg test/test.png
    448 Keywords for file test/test.jpg:
    449 Keywords for file test/test.png:
    450 comment - Testing keyword extraction
    451 @end example
    452 
    453 
    454 @section Introduction to the libextractor library
    455 
    456 Each public symbol exported by GNU libextractor has the prefix
    457 @verb{|EXTRACTOR_|}.  All-caps names are used for constants.  For the
    458 impatient, the minimal C code for using GNU libextractor (on the
    459 executing binary itself) looks like this:
    460 
    461 @verbatim
    462 #include <extractor.h>
    463 
    464 int 
    465 main (int argc, char ** argv) 
    466 {
    467   struct EXTRACTOR_PluginList *plugins
    468     = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY);
    469   EXTRACTOR_extract (plugins, argv[1],
    470                      NULL, 0, 
    471                      &EXTRACTOR_meta_data_print, stdout);
    472   EXTRACTOR_plugin_remove_all (plugins);
    473   return 0;
    474 }
    475 @end verbatim
    476 
    477 The minimal API illustrated by this example is actually sufficient for
    478 many applications.  The full external C API of GNU libextractor is described
    479 in chapter @xref{Extracting meta data}.  Bindings for other languages
    480 are described in chapter @xref{Language bindings}.  The API for
    481 writing new plugins is described in chapter @xref{Writing new Plugins}.
    482 
    483 Note that it is possible for GNU libextractor to encounter a @code{SIGPIPE}
    484 during its execution.  GNU libextractor --- as it is a library and as such
    485 should not interfere with your main application --- does NOT install a 
    486 signal handler for @code{SIGPIPE}.  You thus need to install a signal
    487 handler (or at least tell your system to ignore @code{SIGPIPE}) if you
    488 want to avoid unexpected problems during calls to GNU libextractor.  
    489 @cindex SIGPIPE
    490 
    491 @node Extracting meta data
    492 @chapter Extracting meta data
    493 
    494 In order to extract meta data with GNU libextractor you first need to
    495 load the respective plugins and then call the extraction API
    496 with the plugins and the data to process.  This section
    497 documents how to load and unload plugins, the various types
    498 and formats in which meta data is returned to the application
    499 and finally the extraction API itself.
    500 
    501 @menu
    502 * Plugin management::   How to load and unload plugins
    503 * Meta types::          About meta types
    504 * Meta formats::        About meta formats
    505 * Extracting::          How to use the extraction API
    506 @end menu
    507 
    508 
    509 @node Plugin management
    510 @section Plugin management
    511 
    512 @cindex reentrant
    513 @cindex concurrency
    514 @cindex threads
    515 @cindex thread-safety
    516 @tindex enum EXTRACTOR_Options
    517 
    518 Using GNU libextractor from a multi-threaded parent process requires some
    519 care.  The problem is that on most platforms GNU libextractor starts
    520 sub-processes for the actual extraction work.  This is useful to
    521 isolate the parent process from potential bugs; however, it can cause
    522 problems if the parent process is multi-threaded.  The issue is that
    523 at the time of the fork, another thread of the application may hold a
    524 lock (i.e. in gettext or libc).  That lock would then never be
    525 released in the child process (as the other thread is not present in
    526 the child process).  As a result, the child process would then
    527 deadlock on trying to acquire the lock and never terminate.  This has
    528 actually been observed with a lock in GNU gettext that is triggered by
    529 the plugin startup code when it interacts with libltdl.
    530 
    531 The problem can be solved by loading the plugins using the
    532 @code{EXTRACTOR_OPTION_IN_PROCESS} option, which will run GNU libextractor
    533 in-process and thus avoid the locking issue.  In this case, all of the
    534 functions for loading and unloading plugins, including
    535 @verb{|EXTRACTOR_plugin_add_defaults|} and
    536 @verb{|EXTRACTOR_plugin_remove_all|}, are thread-safe and reentrant.
    537 However, using the same plugin list from multiple threads at the same
    538 time is not safe.  
    539 
    540 All plugin code is expected required to be reentrant and state-less,
    541 but due to the extensive use of 3rd party libraries this cannot
    542 be guaranteed.
    543 
    544 
    545 @deftp {C Struct} EXTRACTOR_PluginList
    546 @tindex struct EXTRACTOR_PluginList
    547 
    548 A plugin list represents a set of GNU libextractor plugins.  Most of
    549 the GNU libextractor API is concerned with either constructing a
    550 plugin list or using it to extract meta data.  The internal representation
    551 of the plugin list is of no concern to users or plugin developers.
    552 @end deftp
    553 
    554 
    555 @deftypefun void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins)
    556 @findex EXTRACTOR_plugin_remove_all
    557 
    558 Unload all of the plugins in the given list.
    559 @end deftypefun
    560 
    561 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name)
    562 @findex EXTRACTOR_plugin_remove
    563 
    564 Unloads a particular plugin.  The given name should be the short name of the plugin, for example ``mime'' for the mime-type extractor or ``mpeg'' for the MPEG extractor.
    565 @end deftypefun
    566 
    567 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags)
    568 @findex EXTRACTOR_plugin_add
    569 
    570 Loads a particular plugin.  The plugin is added to the existing list, which can be @code{NULL}.  The second argument specifies the name of the plugin (i.e. ``ogg'').  The third argument can be @code{NULL} and specifies plugin-specific options.  Finally, the last argument specifies if the plugin should be executed out-of-process (@code{EXTRACTOR_OPTION_DEFAULT_POLICY}) or not.
    571 @end deftypefun
    572 
    573 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags)
    574 @findex EXTRACTOR_plugin_add_config
    575 
    576 Loads and unloads plugins based on a configuration string, modifying the existing list, which can be @code{NULL}.  The string has the format ``[-]NAME(OPTIONS)@{:[-]NAME(OPTIONS)@}*''.  Prefixing the plugin name with a ``-'' means that the plugin should be unloaded.
    577 @end deftypefun
    578 
    579 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags)
    580 @findex EXTRACTOR_plugin_add_defaults
    581 
    582 Loads all of the plugins in the plugin directory.  This function is what most GNU libextractor applications should use to setup the plugins.
    583 @end deftypefun
    584 
    585 
    586 
    587 @node Meta types
    588 @section Meta types
    589 
    590 
    591 @tindex enum EXTRACTOR_MetaType
    592 @findex EXTRACTOR_metatype_get_max
    593 
    594 @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data.  The total number can differ between different GNU libextractor releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function.  All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}.
    595 
    596 @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type)
    597 @findex EXTRACTOR_metatype_to_string
    598 @cindex gettext
    599 @cindex internationalization
    600 
    601 The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type.  The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}).  
    602 @end deftypefun
    603 
    604 @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type)
    605 @findex EXTRACTOR_metatype_to_description
    606 @cindex gettext
    607 @cindex internationalization
    608 
    609 The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type.  The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive.  The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}).  
    610 @end deftypefun
    611 
    612 
    613 
    614 @node Meta formats
    615 @section Meta formats
    616 
    617 @tindex enum EXTRACTOR_MetaFormat
    618 
    619 @verb{|enum EXTRACTOR_MetaFormat|} is a C enum which defines on a high level how the extracted meta data is represented.  Currently, the library uses three formats: UTF-8 strings, C strings and binary data.  A fourth value, @code{EXTRACTOR_METAFORMAT_UNKNOWN} is defined but not used.  UTF-8 strings are 0-terminated strings that have been converted to UTF-8.  The format code is @code{EXTRACTOR_METAFORMAT_UTF8}. Ideally, most text meta data will be of this format.  Some file formats fail to specify the encoding used for the text.  In this case, the text cannot be converted to UTF-8.  However, the meta data is still known to be 0-terminated and presumably human-readable.  In this case, the format code used is @code{EXTRACTOR_METAFORMAT_C_STRING}; however, this should not be understood to mean that the encoding is the same as that used by the C compiler.  Finally, for binary data (mostly images), the format @code{EXTRACTOR_METAFORMAT_BINARY} is used.
    620 
    621 Naturally this is not a precise description of the meta format. Plugins can provide a more precise description (if known) by providing the respective mime type of the meta data.  For example, binary image meta data could be also tagged as ``image/png'' and normal text would typically be tagged as ``text/plain''.  
    622 
    623 
    624 
    625 @node Extracting
    626 @section Extracting
    627 
    628 @deftypefn {Function Pointer} int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len)
    629 @tindex EXTRACTOR_MetaDataProcessor
    630 
    631 Type of a function that libextractor calls for each meta data item found.
    632 
    633 @table @var
    634 
    635 @item cls 
    636 closure (user-defined)
    637 
    638 @item plugin_name 
    639 name of the plugin that produced this value; special values can be used (i.e. '<zlib>' for zlib being used in the main libextractor library and yielding meta data);
    640 
    641 @item type 
    642 libextractor-type describing the meta data;
    643 
    644 @item format basic 
    645 format information about data
    646 
    647 @item data_mime_type 
    648 mime-type of data (not of the original file); can be @code{NULL} (if mime-type is not known);
    649 
    650 @item data 
    651 actual meta-data found
    652 
    653 @item data_len 
    654 number of bytes in data
    655 
    656 @end table
    657 
    658 Return 0 to continue extracting, 1 to abort.
    659 @end deftypefn
    660 
    661 
    662 
    663 @deftypefun void EXTRACTOR_extract (struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls)
    664 @findex EXTRACTOR_extract
    665 @cindex reentrant
    666 @cindex concurrency
    667 @cindex threads
    668 @cindex thread-safety
    669 
    670 This is the main function for extracting keywords with GNU libextractor.  The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data.  The @samp{filename} argument is optional and can be used to specify the name of a file to process.  If @samp{filename} is @code{NULL}, then the @samp{data} argument must point to the in-memory data to extract meta data from.  If @samp{filename} is non-@code{NULL}, @samp{data} can be @code{NULL}.  If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes.  Otherwise @samp{size} should be zero.  For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}.  The other arguments to @samp{proc} depend on the specific meta data found.  
    671 
    672 @cindex SIGBUS
    673 @cindex bus error
    674 Meta data extraction should never really fail --- at worst, GNU libextractor should not call @samp{proc} with any meta data. By design, GNU libextractor should never crash or leak memory, even given corrupt files as input.  Note however, that running GNU libextractor on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process.  As GNU libextractor typically runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it.  During decompression it is possible to encounter a SIGBUS.   GNU libextractor will @emph{not} attempt to catch this signal and your application is likely to crash.  Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt).  If this is not acceptable, you might want to consider running GNU libextractor itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}).
    675 
    676 @end deftypefun
    677 
    678 
    679 @node Language bindings
    680 @chapter Language bindings
    681 @cindex Java
    682 @cindex Mono
    683 @cindex Perl
    684 @cindex Python
    685 @cindex PHP
    686 @cindex Ruby
    687 
    688 GNU libextractor works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main GNU libextractor website.  Documentation for these bindings (if available) is part of the downloads for the respective binding.  In all cases, a full installation of the C library is required before the binding can be installed.
    689 
    690 @section Java
    691 
    692 Compiling the GNU libextractor Java binding follows the usual process of
    693 running @command{configure} and @command{make}.  The result will be a
    694 shared C library @file{libextractor_java.so} with the native code and
    695 a JAR file (installed to @file{$PREFIX/share/java/libextractor.java}).
    696 
    697 A minimal example for using GNU libextractor's Java binding would look
    698 like this:
    699 @verbatim
    700 import org.gnu.libextractor.*;
    701 import java.util.ArrayList;
    702 
    703 public static void main(String[] args) {
    704   Extractor ex = Extractor.getDefault();
    705   for (int i=0;i<args.length;i++) {
    706     ArrayList keywords = ex.extract(args[i]);
    707     System.out.println("Keywords for " + args[i] + ":");
    708     for (int j=0;j<keywords.size();j++)
    709       System.out.println(keywords.get(j));
    710   }
    711 }
    712 @end verbatim
    713 
    714 The GNU libextractor library and the @file{libextractor_java.so} JNI binding
    715 have to be in the library search path for this to work.  Furthermore, the
    716 @file{libextractor.jar} file should be on the classpath.  
    717 
    718 Note that the API does not use Java 5 style generics in order to work
    719 with older versions of Java.
    720 
    721 @section Mono
    722 
    723 his binding is undocumented at this point.
    724 
    725 @section Perl
    726 
    727 This binding is undocumented at this point.
    728 
    729 @section Python
    730 
    731 This binding is undocumented at this point.
    732 
    733 @section PHP
    734 
    735 This binding is undocumented at this point.
    736 
    737 @section Ruby
    738 
    739 This binding is undocumented at this point.
    740 
    741 
    742 
    743 @node Utility functions
    744 @chapter Utility functions
    745 
    746 @cindex reentrant
    747 @cindex concurrency
    748 @cindex threads
    749 @cindex thread-safety
    750 This chapter describes various utility functions for GNU libextractor usage. All of the functions are reentrant.
    751 
    752 @menu
    753 * Utility Constants::
    754 * Meta data printing::
    755 @end menu
    756 
    757 @node Utility Constants
    758 @section Utility Constants
    759 
    760 @findex EXTRACTOR_VERSION
    761 The constant @verb{|EXTRACTOR_VERSION|} is a hexadecimal
    762 representation of the version number of the installed libextractor
    763 header.  The hexadecimal format is 0xAABBCCDD where AA is the major
    764 version (so far always 0), BB is the minor version, CC is the revision
    765 and DD the patch number.  For example, for version 0.5.18, we would
    766 have AA=0, BB=5, CC=18 and DD=0.  Minor releases such as 0.5.18a or
    767 significant changes in unreleased versions would be marked with DD=1
    768 or higher.
    769 
    770 
    771 @node Meta data printing
    772 @section Meta data printing
    773 
    774 
    775 @findex EXTRACTOR_meta_data_print
    776 The @verb{|EXTRACTOR_meta_data_print|} is a simple function which prints the meta data found with libextractor to a file.  The function is mostly useful for debugging and as an example for how to manipulate the keyword list and can be passed as the @samp{proc} argument to @code{EXTRACTOR_extract}.  The file to print to should be passed as @samp{proc_cls} (which must be of type @code{FILE *}), for example @code{stdout}.
    777 
    778 
    779 
    780 @node Existing Plugins
    781 @chapter Existing Plugins
    782 
    783 @itemize @bullet
    784 @item
    785 ARCHIVE (using libarchive)
    786 @item
    787 DVI
    788 @item
    789 EXIV2 (using libexiv2, 0.23 or later preferred)
    790 @item 
    791 FLAC (using libFLAC)
    792 @item
    793 GIF (using libgif)
    794 @item
    795 GSTREAMER (using libgstreamer v1.0 or later)
    796 @item
    797 HTML (using libtidy)
    798 @item
    799 IT 
    800 @item
    801 JPEG (using libjpeg v8 or later)
    802 @item
    803 MAN
    804 @item
    805 MIDI (using libsmf)
    806 @item
    807 MIME (using libmagic)
    808 @item
    809 MPEG (using libmpeg2)
    810 @item
    811 NSF
    812 @item
    813 NSFE
    814 @item
    815 ODF
    816 @item
    817 OLE2 (with libgsf)
    818 @item
    819 OGG (with libogg)
    820 @item
    821 PNG
    822 @item
    823 PS
    824 @item
    825 RIFF
    826 @item
    827 RPM (using librpm)
    828 @item
    829 S3M
    830 @item
    831 SID
    832 @item
    833 ThumbnailFFMPEG (using libavformat and related libav-libraries, including libswscale)
    834 @item
    835 ThumbnailGtk (using libgtk)
    836 @item
    837 TIFF (with libtiff, tested with v4)
    838 @item
    839 WAV
    840 @item
    841 XM
    842 @item
    843 ZIP
    844 @end itemize
    845 
    846 @file{gzip} and @file{bzip2} compressed versions of these formats are 
    847 also supported (as well as meta data embedded by @file{gzip} itself)
    848 if zlib or libbz2 are available.
    849 
    850 @node Writing new Plugins
    851 @chapter Writing new Plugins
    852 
    853 Writing a new plugin for libextractor usually requires writing of or
    854 interfacing with an actual parser for a specific format.  How this is
    855 can be accomplished depends on the format and cannot be specified in
    856 general.  However, care should be taken for the code to be reentrant
    857 and highly fault-tolerant, especially with respect to malformed
    858 inputs.
    859 
    860 Plugins should start by verifying that the header of the data matches
    861 the specific format and immediately return if that is not the case.
    862 Even if the header matches the expected file format, plugins must not
    863 assume that the remainder of the file is well formed.
    864 
    865 The plugin library must be called libextractor_XXX.so, where XXX 
    866 denotes the file format of the plugin. The library must export a 
    867 method @verb{|libextractor_XXX_extract_method|}, with the following 
    868 signature:
    869 @verbatim
    870 void
    871 EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec);
    872 @end verbatim
    873 
    874 @samp{ec} contains various information the plugin may need for its
    875 execution.  Most importantly, it contains functions for reading
    876 (``read'') and seeking (``seek'') the input data and for returning
    877 extracted data (``proc'').  The ``config'' member can contain
    878 additional configuration options.  ``proc'' should be called on
    879 each meta data item found.  If ``proc'' returns non-zero,
    880 processing should be aborted (if possible).
    881 
    882 In order to test new plugins, the @file{extract} command can be run
    883 with the options ``-ni'' and ``-l XXX'' .  This will run the plugin
    884 in-process (making it easier to debug) and without any of the other
    885 plugins.
    886 
    887 
    888 @section Example for a minimal extract method
    889 
    890 The following example shows how a plugin can return the mime type of
    891 a file.
    892 @example
    893 @verbatim
    894 void
    895 EXTRACTOR_mymime_extract (struct EXTRACTOR_ExtractContext *ec)
    896 {
    897   void *data;
    898   ssize_t data_size,
    899 
    900   if (-1 == (data_size = ec->read (ec->cls, &data, 4)))
    901     return; /* read error */
    902   if (data_size < 4)
    903     return; /* file too small */
    904   if (0 != memcmp (data, "\177ELF", 4))
    905     return; /* not ELF */
    906   if (0 != ec->proc (ec->cls, 
    907                      "mymime",
    908                      EXTRACTOR_METATYPE_MIMETYPE,
    909                      EXTRACTOR_METAFORMAT_UTF8,
    910                      "text/plain",
    911                      "application/x-executable",
    912                      1 + strlen("application/x-executable")))
    913     return;
    914   /* more calls to 'proc' here as needed */
    915 }
    916 @end verbatim
    917 @end example
    918 
    919 
    920 @node Internal utility functions
    921 @chapter Internal utility functions
    922 
    923 Some plugins link against the @code{libextractor_common} library which
    924 provides common abstractions needed by many plugins.  This section
    925 documents this internal API for plugin developers.  Note that the headers
    926 for this library are (intentionally) not installed: we do not consider
    927 this API stable and it should hence only be used by plugins that are 
    928 build and shipped with GNU libextractor.  Third-party plugins should
    929 not use it.
    930 
    931 @file{convert_numeric.h} defines various conversion functions for
    932 numbers (in particular, byte-order conversion for floating point
    933 numbers).  
    934 
    935 @file{unzip.h} defines an API for accessing compressed files.
    936 
    937 @file{pack.h} provides an interpreter for unpacking structs of integer
    938 numbers from streams and converting from big or little endian to host
    939 byte order at the same time.
    940 
    941 @file{convert.h} provides a function for character set conversion described
    942 below.
    943 
    944 @deftypefun {char *} EXTRACTOR_common_convert_to_utf8 (const char *input, size_t len, const char *charset)
    945 @cindex UTF-8
    946 @cindex character set
    947 @findex EXTRACTOR_common_convert_to_utf8
    948 Various GNU libextractor plugins make use of the internal
    949 @file{convert.h} header which defines a function
    950 
    951 @verb{|EXTRACTOR_common_convert_to_utf8|} which can be used to easily convert text from
    952 any character set to UTF-8.  This conversion is important since the
    953 linked list of keywords that is returned by GNU libextractor is
    954 expected to contain only UTF-8 strings.  Naturally, proper conversion
    955 may not always be possible since some file formats fail to specify the
    956 character set.  In that case, it is often better to not convert at
    957 all.
    958 
    959 The arguments to @verb{|EXTRACTOR_common_convert_to_utf8|} are the input string (which
    960 does @emph{not} have to be zero-terminated), the length of the input
    961 string, and the character set (which @emph{must} be zero-terminated).
    962 Which character sets are supported depends on the platform, a list can
    963 generally be obtained using the @command{iconv -l} command.  The
    964 return value from @verb{|EXTRACTOR_common_convert_to_utf8|} is a zero-terminated string
    965 in UTF-8 format.  The responsibility to free the string is with the
    966 caller, so storing the string in the keyword list is acceptable.
    967 @end deftypefun
    968 
    969 
    970 
    971 
    972 
    973 @node Reporting bugs
    974 @chapter Reporting bugs
    975 
    976 @cindex bug
    977 GNU libextractor uses the @url{https://gnunet.org/bugs/,Mantis bugtracking
    978 system}.  If possible, please report bugs there.  You can also e-mail
    979 the GNU libextractor mailinglist at @url{libextractor@@gnu.org}.
    980 
    981 
    982 
    983 @c **********************************************************
    984 @c *******************  Appendices  *************************
    985 @c **********************************************************
    986 
    987 @node GNU Free Documentation License
    988 @appendix GNU Free Documentation License
    989 
    990 @include fdl-1.3.texi
    991 
    992 
    993 @node Index
    994 @unnumbered Index
    995 
    996 @printindex cp
    997 
    998 @c @node Function and Data Index
    999 @c @unnumbered Function and Data Index
   1000 @c @printindex fn
   1001 
   1002 @c @node Type Index
   1003 @c @unnumbered Type Index
   1004 @c @printindex tp
   1005 
   1006 @bye