libextractor

GNU libextractor
Log | Files | Refs | Submodules | README | LICENSE

lj.txt (10751B)


      1         Reading file meta-data with extract and libextractor
      2 
      3 
      4                      by Christian Grothoff
      5 
      6 
      7 INTRODUCTION
      8 
      9 Modern file formats have provisions to annotate the contents of the
     10 file with descriptive information.  This development is driven by the
     11 need to find a better way to organize data than merely using
     12 filenames.  The problem with such meta-data is that the way it is
     13 stored is not standardized across different file formats.  This makes
     14 it difficult for format-agnostic tools such as file managers or
     15 file-sharing applications to make use of the information.  It also
     16 results in a plethora of format-specific tools that are used to
     17 extract the meta-data, such as AVInfo [7] id3edit [8], jpeginfo [9]
     18 or Vocoditor [10].
     19 
     20 In this article, the libextractor library and the extract tool are
     21 introduced.  The goal of the libextractor project [1] is to provide a
     22 uniform interface for obtaining meta-data from different file formats.
     23 libextractor is currently used by evidence [3], the file-manager for
     24 the forthcomming version of Enlightenment [13], and GNUnet [4], an
     25 anonymous, censorship-resistant peer-to-peer file-sharing system.  The
     26 extract tool is a command-line interface to the library.  libextractor
     27 is licensed under the GNU Public License [14].
     28 
     29 libextractor shares some similarities with the popular file [11] tool
     30 which uses the first bytes in a file to guess the mime-type.
     31 libextractor differs from file that it tries to obtain much more
     32 information than just the mime-type.  Depending on the file format,
     33 libextractor can obtain additional information.  Examples of this
     34 extra information include the name of the software used to create the
     35 file, the author, descriptions, album titles, image dimensions or the
     36 duration of the movie.
     37 
     38 libextractor achieves this by using specific parser code for many
     39 popular formats.  The list currently includes mp3, ogg, real-media,
     40 mpeg, riff (avi), gif, jpeg, png, tiff, html, pdf, ps and zip as well
     41 as generic methods such as mime-type detection.  Many other formats
     42 exist [5], but among the more popular formats only a few proprietary
     43 formats are not supported.  Integrating support for new formats is
     44 easy since libextractor uses plugins to gather data.  libextractor
     45 plugins are shared libraries that typically provide code to parse one
     46 particular format.  At the end of the article we will show how to
     47 integrate support for new formats into the library.  libextractor
     48 gathers the meta-data obtained from the various plugins and provides
     49 clients with a list of pairs consisting of a classification and a
     50 character sequence.  The classification is used to organize the
     51 meta-data into categories like title, creator, subject, description
     52 and so on [6].
     53 
     54 
     55 [ INSTALLING LIBEXTRACTOR AND USING EXTRACT ]
     56 
     57 The simplest way to install libextractor is to use one of the binary
     58 packages which are available online for many distributions.  Note
     59 that under Debian, the extract tool is in a separate package extract
     60 [15] and headers required to compile other applications against
     61 libextractor are in libextractor0-devel [16].  If you want to compile
     62 libextractor from source you will need an unusual amount of memory:
     63 256 MB system memory is roughly the minimum, since gcc will take about
     64 200 MB to compile one of the plugins.  Otherwise, compiling by hand
     65 follows the usual sequence:
     66 
     67 $ wget http://ovmj.org/libextractor/download/libextractor-0.3.1.tar.gz
     68 $ tar xvfz libextractor-0.3.1.tar.gz
     69 $ cd libextractor-0.3.1
     70 $ ./configure --prefix=/usr/local
     71 $ make
     72 # make install
     73 
     74 After installing libextractor, the extract tool can be used to obtain
     75 meta-data from documents.  By default, the extract tool uses the
     76 canonical set of plugins, which consists of all file-format-specific
     77 plugins supported by the current version of libextractor together with
     78 the mime-type detection plugin.  For example, extract returns something
     79 like the following for LinuxJournal's webpage:
     80 
     81 $ wget -q http://www.linuxjournal.com/
     82 $ extract index.html
     83 description - The Monthly Magazine of the Linux Community
     84 keywords - linux, linux journal, magazine
     85 author - Linux Journal  - The Premier Magazine of the Linux Community
     86 title - Linux Journal  - The Premier Magazine of the Linux Community
     87 
     88 If you are a user of bibtex [12] the option -b is likely to come in
     89 handy to automatically create bibtex entries from documents that have
     90 been properly equipped with meta-data:
     91 
     92 $ wget -q http://www.copyright.gov/legislation/dmca.pdf
     93 $ extract -b ~/dmca.pdf
     94 % BiBTeX file
     95 @misc{ unite2001the_d,
     96     title = "The Digital Millennium Copyright Act of 1998",
     97     author = "United States Copyright Office - jmf",
     98     note = "digital millennium copyright act circumvention technological protection management information online service provider liability limitation computer maintenance competitiion repair ephemeral recording webcasting distance education study vessel hull",
     99     year = "2001",
    100     month = "10",
    101     key = "Copyright Office Summary of the DMCA",
    102     pages = "18"
    103 }
    104 
    105 Another interesting option is "-B LANG".  This option loads one of the
    106 language specific (but format-agnostic) plugins.  These plugins
    107 attempt to find plaintext in a document by matching strings in the
    108 document against a dictionary.  If the need for 200 MB of memory to
    109 compile libextractor seems mysterious, the answer lies in these
    110 plugins.  In order to be able to perform a fast dictionary search, a
    111 bloomfilter [17] is created that allows fast probabilistic matching;
    112 gcc finds the resulting datastructure a bit hard to swallow.  The
    113 option -B is useful for formats that are undocumented or currently
    114 unsupported.  Note that the printable plugins typically print the
    115 entire text of the document in order.  A typical use is:
    116 
    117 $ wget -q http://www.bayern.de/HDBG/polges.doc
    118 $ extract -B de polges.doc | head -n 4 
    119 unknown - FEE Politische Geschichte Bayerns Herausgegeben vom Haus der Geschichte als Heft der zur Geschichte und Kultur Redaktion Manfred Bearbeitung Otto Copyright Haus der Geschichte München Gestaltung fürs Internet Rudolf Inhalt im.
    120 unknown - und das Deutsche Reich.
    121 unknown - und seine.
    122 unknown - Henker im Zeitalter von Reformation und Gegenreformation.
    123 
    124 This is a rather precise description of the text for a German
    125 speaker.  The supported languages at the moment are Danish (da),
    126 German (de), English (en), Spanish (es), Italian (it) and Norvegian
    127 (no).  Supporting other languages is merely a question of adding
    128 (free) dictionaries in an appropriate character set.  Further options
    129 are described in the extract manpage (man 1 extract).
    130 
    131 
    132 [ USING LIBEXTRACTOR IN YOUR PROJECTS ]
    133 
    134 The shortest program using libextractor looks roughly like this
    135 (compilation requires passing the option -Lextractor to gcc):
    136 
    137 #include <extractor.h>                 
    138 int main(int argc, char * argv[]) {
    139   EXTRACTOR_ExtractorList * plugins;
    140   EXTRACTOR_KeywordList   * md_list;
    141   plugins = EXTRACTOR_loadDefaultLibraries(); 
    142   md_list = EXTRACTOR_getKeywords(plugins, argv[1]);
    143   EXTRACTOR_printKeywords(stdout, md_list); 
    144   EXTRACTOR_freeKeywords(md_list); 
    145   EXTRACTOR_removeAll(plugins); /* unload plugins */
    146 }
    147 
    148 The EXTRACTOR_KeywordList is a simple linked list containing
    149 a keyword and a keyword type.  For details and additional functions
    150 for loading plugins and manipulating the keyword list, see the
    151 libextractor manpage (man 3 libextractor).  Java programmers
    152 should note that a Java class that uses JNI to communicate with
    153 libextractor is also available.
    154 
    155 
    156 [ WRITING PLUGINS ]
    157 
    158 The most complicated thing when writing a new plugin for libextractor
    159 is the writing of the actual parser for a specific format.
    160 Nevertheless, the basic pattern is always the same.  The plugin
    161 library must be called libextractor_XXX.so where XXX denotes the file
    162 format of the plugin.  The library must export a method
    163 libextractor_XXX_extract with the following signature:
    164 
    165 struct EXTRACTOR_Keywords * 
    166 libextractor_XXX_extract
    167    (char * filename,
    168     char * data,
    169     size_t size,
    170     struct EXTRACTOR_Keywords * prev);
    171 
    172 The argument filename specifies the name of the file that is being
    173 processed.  data is a pointer to the (typically mmapped) contents of
    174 the file, and size is the filesize.  Most plugins to not make use of
    175 the filename and just directly parse data directly, staring by
    176 verifying that the header of the data matches the specific format.
    177 prev is the list of keywords that have been extracted so far by other
    178 plugins for the file.  The function is expected to return an updated
    179 list of keywords.  If the format does not match the expectations of
    180 the plugin, prev is returned.  Most plugins use a function like
    181 addKeyword to extend the list:
    182 
    183 static void addKeyword
    184    (struct EXTRACTOR_Keywords ** list,
    185     char * keyword,
    186     EXTRACTOR_KeywordType type) 
    187 {
    188   EXTRACTOR_KeywordList * next;
    189   next = malloc(sizeof(EXTRACTOR_KeywordList));
    190   next->next = *list;
    191   next->keyword = keyword;
    192   next->keywordType = type;
    193   *list = next;
    194 }
    195 
    196 A typical use of addKeyword is to add the mime-type once the file format
    197 has been established.  For example, the JPEG-extractor checks the first
    198 bytes of the JPEG header and then either aborts or claims the file to be
    199 a JPEG:
    200 
    201 if ( (data[0] != 0xFF) || (data[1] != 0xD8) )
    202   return prev; /* not a JPEG */
    203 addKeyword(&prev,
    204            strdup("image/jpeg"),
    205            EXTRACTOR_MIMETYPE);
    206 /* ... more parsing code here ... */
    207 return prev;
    208 
    209 Note that the strdup here is important since the string will be
    210 deallocated later, typically in EXTRACTOR_freeKeywords().  A list of
    211 supported keyword classifications (EXTRACTOR_XXXX) can be found in the
    212 extractor.h header file [18].
    213 
    214 
    215 
    216 [ CONCLUSION ]
    217 
    218 libextractor is a simple extensible C library for obtaining meta-data
    219 from documents.  Its plugin architecture and broad support for formats
    220 set it apart from format-specific tools.  The design is limited by the
    221 fact that libextractor cannot be used to update meta-data, which more
    222 specialized tools typically support.
    223 
    224 
    225 [ REFERENCES ]
    226 
    227 [1] http://ovmj.org/libextractor/  
    228 [2] http://getid3.sf.net/ 
    229 [3] http://evidence.sf.net/
    230 [4] http://ovmj.org/GNUnet/
    231 [5] http://www.wotsit.org/
    232 [6] http://dublincore.org/documents/dcmi-terms/
    233 [7] http://freshmeat.net/projects/aviinfo/
    234 [8] http://freshmeat.net/projects/id3edit/
    235 [9] http://freshmeat.net/projects/jpeginfo/
    236 [10] http://freshmeat.net/projects/vocoditor/
    237 [11] http://freshmeat.net/projects/file/
    238 [12] http://dmoz.org/Computers/Software/Typesetting/TeX/BibTeX/
    239 [13] http://enlightenment.org/
    240 [14] http://www.gnu.org/licenses/gpl.html
    241 [15] http://packages.debian.org/extract
    242 [16] http://packages.debian.org/libextractor0-devel
    243 [17] http://ovmj.org/GNUnet/download/bloomfilter.ps
    244 [18] http://ovmj.org/libextractor/doxygen/html/extractor_8h-source.html
    245