libextractor

GNU libextractor
Log | Files | Refs | Submodules | README | LICENSE

lj.txt (7647B)


      1         Reading file meta-data with extract and libextractor
      2 
      3 
      4                      by Christian Grothoff
      5 
      6 
      7 INTRODUCTION
      8 
      9 Modern file formats have provisions to annotate the contents of the
     10 file with descriptive information.  This development is driven by the
     11 need to find a better way to organize data than merely using
     12 filenames.  The problem with such meta-data is that the way it is
     13 stored is not standardized across different file formats.  This makes
     14 it difficult for format-agnostic tools such as file managers or
     15 file-sharing applications to make use of the information.  It also
     16 results in a plethora of format-specific tools that are used to
     17 extract the meta-data, such as AVInfo [7] id3edit [8], jpeginfo [9]
     18 or Vocoditor [10].
     19 
     20 In this article, the libextractor library and the extract tool are
     21 introduced.  The goal of the libextractor project [1] is to provide a
     22 uniform interface for obtaining meta-data from different file formats.
     23 libextractor is currently used by evidence [3], the file-manager for
     24 the forthcomming version of Enlightenment [13], and GNUnet [4], an
     25 anonymous, censorship-resistant peer-to-peer file-sharing system.  The
     26 extract tool is a command-line interface to the library.  libextractor
     27 is licensed under the GNU Public License [14].
     28 
     29 libextractor shares some similarities with the popular file [11] tool
     30 which uses the first bytes in a file to guess the mime-type.
     31 libextractor differs from file that it tries to obtain much more
     32 information than just the mime-type.  Depending on the file format,
     33 libextractor can obtain additional information.  Examples of this
     34 extra information include the name of the software used to create the
     35 file, the author, descriptions, album titles, image dimensions or the
     36 duration of the movie.
     37 
     38 libextractor achieves this by using specific parser code for many
     39 popular formats.  The list currently includes mp3, ogg, real-media,
     40 mpeg, riff (avi), gif, jpeg, png, tiff, html, pdf, ps and zip as well
     41 as generic methods such as mime-type detection.  Many other formats
     42 exist [5], but among the more popular formats only a few proprietary
     43 formats are not supported.  Integrating support for new formats is
     44 easy since libextractor uses plugins to gather data.  libextractor
     45 plugins are shared libraries that typically provide code to parse one
     46 particular format.  At the end of the article we will show how to
     47 integrate support for new formats into the library.  libextractor
     48 gathers the meta-data obtained from the various plugins and provides
     49 clients with a list of pairs consisting of a classification and a
     50 character sequence.  The classification is used to organize the
     51 meta-data into categories like title, creator, subject, description
     52 and so on [6].
     53 
     54 
     55 [ INSTALLING LIBEXTRACTOR AND USING EXTRACT ]
     56 
     57 The simplest way to install libextractor is to use one of the binary
     58 packages which are available online for many distributions.  Note
     59 that under Debian, the extract tool is in a separate package extract
     60 [15] and headers required to compile other applications against
     61 libextractor are in libextractor0-devel [16].  If you want to compile
     62 libextractor from source you will need an unusual amount of memory:
     63 256 MB system memory is roughly the minimum, since gcc will take about
     64 200 MB to compile one of the plugins.  Otherwise, compiling by hand
     65 follows the usual sequence as shown in figure [compiling.txt].
     66 
     67 
     68 
     69 After installing libextractor, the extract tool can be used to obtain
     70 meta-data from documents.  By default, the extract tool uses the
     71 canonical set of plugins, which consists of all file-format-specific
     72 plugins supported by the current version of libextractor together with
     73 the mime-type detection plugin.  An example output, here for the 
     74 LinuxJournal's webpage is shown in figure [wget_lj.txt].
     75 
     76 If you are a user of bibtex [12] the option -b is likely to come in
     77 handy to automatically create bibtex entries from documents that have
     78 been properly equipped with meta-data, as shown in figure [dmca.txt].  
     79 
     80 Another interesting option is "-B LANG".  This option loads one of the
     81 language specific (but format-agnostic) plugins.  These plugins
     82 attempt to find plaintext in a document by matching strings in the
     83 document against a dictionary.  If the need for 200 MB of memory to
     84 compile libextractor seems mysterious, the answer lies in these
     85 plugins.  In order to be able to perform a fast dictionary search, a
     86 bloomfilter [17] is created that allows fast probabilistic matching;
     87 gcc finds the resulting datastructure a bit hard to swallow.  The
     88 option -B is useful for formats that are undocumented or currently
     89 unsupported.  Note that the printable plugins typically print the
     90 entire text of the document in order.  Figure [doc.txt] shows the
     91 output of extract for a Winword document.
     92 
     93 This is a rather precise description of the text for a German
     94 speaker.  The supported languages at the moment are Danish (da),
     95 German (de), English (en), Spanish (es), Italian (it) and Norvegian
     96 (no).  Supporting other languages is merely a question of adding
     97 (free) dictionaries in an appropriate character set.  Further options
     98 are described in the extract manpage (man 1 extract).
     99 
    100 
    101 [ USING LIBEXTRACTOR IN YOUR PROJECTS ]
    102 
    103 Listing [minimal.c] shows the code of a minimalistic program that uses
    104 libextractor.  Compiling minimal.c requires passing the option
    105 -lextractor to gcc.  The EXTRACTOR_KeywordList is a simple linked list
    106 containing a keyword and a keyword type.  For details and additional
    107 functions for loading plugins and manipulating the keyword list, see
    108 the libextractor manpage (man 3 libextractor).  Java programmers
    109 should note that a Java class that uses JNI to communicate with
    110 libextractor is also available.
    111 
    112 
    113 [ WRITING PLUGINS ]
    114 
    115 The most complicated thing when writing a new plugin for libextractor
    116 is the writing of the actual parser for a specific format.
    117 Nevertheless, the basic pattern is always the same.  The plugin
    118 library must be called libextractor_XXX.so where XXX denotes the file
    119 format of the plugin.  The library must export a method
    120 libextractor_XXX_extract with the following signature shown in
    121 listing [signature.c].
    122 
    123 The argument filename specifies the name of the file that is being
    124 processed.  data is a pointer to the (typically mmapped) contents of
    125 the file, and size is the filesize.  Most plugins to not make use of
    126 the filename and just directly parse data directly, staring by
    127 verifying that the header of the data matches the specific format.
    128 prev is the list of keywords that have been extracted so far by other
    129 plugins for the file.  The function is expected to return an updated
    130 list of keywords.  If the format does not match the expectations of
    131 the plugin, prev is returned.  Most plugins use a function like
    132 addKeyword (listing [addkeyword.c]) to extend the list.
    133 
    134 A typical use of addKeyword is to add the mime-type once the file
    135 format has been established.  For example, the JPEG-extractor (listing
    136 [plugin.c]) checks the first bytes of the JPEG header and then either
    137 aborts or claims the file to be a JPEG.  Note that the strdup in the
    138 code is important since the string will be deallocated later,
    139 typically in EXTRACTOR_freeKeywords().  A list of supported keyword
    140 classifications (in the example EXTRACTOR_MIMETYPE) can be found in
    141 the extractor.h header file [18].
    142 
    143 
    144 
    145 [ CONCLUSION ]
    146 
    147 libextractor is a simple extensible C library for obtaining meta-data
    148 from documents.  Its plugin architecture and broad support for formats
    149 set it apart from format-specific tools.  The design is limited by the
    150 fact that libextractor cannot be used to update meta-data, which more
    151 specialized tools typically support.
    152