Reading file meta-data with extract and libextractor


                     by Christian Grothoff


INTRODUCTION

Modern file formats have provisions to annotate the contents of the
file with descriptive information.  This development is driven by the
need to organize data better than merely by using filenames.  The
problem with such meta-data is that the way it is stored is not
standardized.  This makes it difficult for format-agnostic tools such
as file-managers or file-sharing applications to make use of the
information.  Also it results in a plehora of format-specific tools
that are used to extract the meta-data, such as AVInfo [7] id3edit
[8], jpeginfo [9], ldd [10] or Vocoditor [11].

In this article the libextractor library and the extract tool are
introduced.  The goal of the libextractor project [1] is to provide a
uniform interface for obtaining meta-data from different file-formats.
libextractor is currently used by evidence [3] the file-manager for
the forthcomming version of Enlightenment [13] and GNUnet [4], an
anonymous, censorship-resistant peer-to-peer file-sharing system.  The
extract tool is a simple command line interface to the library.
libextractor is licensed under the GNU Public License [14].

libextractor is somewhat similar to the popular file [12] tool which
uses the first bytes in a file to guess the mime-type.  libextractor
differs from file in two major ways.  First, libextractor tries to
obtain much more information than just the mime-type.  Depending on
the file-format libextractor can obtain additional information.
Examples include the software used to create the file, the author, a
description, the album, the image dimensions or the duration of the
movie.

libextractor achieves all of this by using both file-format specific
code for many popular formats.  The list currently includes mp3, ogg,
real-media, mpeg, riff (avi), gif, jpeg, png, tiff, html, pdf, ps and
zip as well as generic methods such as mime-type detection.  Many
other formats exist [5], but among the more popular formats only
various proprietary formats are not supported.  At the end of the
article we will show how easy it is to integrate support for new
formats into the library.  Integrating support for new formats is easy
since libextractor uses plugins to gather data.  libextractor plugins
are shared libraries that typically provide code to parse one
particular format.  libextractor gathers the meta-data obtained from
the various plugins and provides clients with a list of pairs
consisting of a classification and a character sequence.  The
classification is used to organize the meta-data into categories like
title, creator, subject, description and so on [6].


[ INSTALLING LIBEXTRACTOR AND USING EXTRACT ]

The simplest way to install libextractor is to use one of the binary
packages which are available on-line for many distributions.  Note
that under Debian the extract tool is in a separate package extract
[15] and headers required to compile other applications against
libextractor are in libextractor0-devel [16].  If you want to compile
libextractor from source you will need an unusual amount of memory,
256 MB system memory is roughly the minimum since gcc will take about
200 MB to compile one of the plugins.  Otherwise compiling by hand
follows the usual sequence:

After installing libextractor the extract tool can be used to obtain
meta-data from documents.  By default the extract tool uses the
canonical set of plugins which consists of all file-format specific
plugins supported by the current version of libextractor together with
the mime-type detection plugin).  For example, extract returns for the
webpage of the LinuxJournal something like this:

If you are a user of bibtex [17] the option -b is likely to come in
handy to automatically create bibtex entries from documents that have
been properly equipped with meta-data:

Another interesting option is "-B LANG".  This option loads one of the
language specific but format agnostic plugins.  These plugins attempt
to find plaintext in a document by matching strings in the document
against a dictionary.  If you wondered why libextractor takes 200 MB
to compile, the answer lies in these plugins.  In order to be able to
perform an fast dictionary search a bloomfilter [18] is created that
allows fast probabilistic matching, and gcc finds the resulting
datastructure a bit hard to swallow.  The option -B is useful for
formats that are undocumented or just currently unsupported.  Note
that the printable plugins typically print the entire text of the
document in order.  A typical use is:

Which is a rather precise description of the text for a German
speaker.  The supported languages at the moment are Danish (da),
German (de), English (en), Spanish (es), Italien (it) and Norvegian
(no).  Supporting other languages is merely a question of adding
(free) dictionaries in an appropriate character set.  Further options
are described in the extract manpage (man 1 extract).


[ USING LIBEXTRACTOR IN YOUR PROJECTS ]

The shortest program using libextractor looks roughly like this
(compilation requires passing the option -Lextractor to gcc):

The EXTRACTOR_KeywordList is a simple linked list containing
a keyword and a keyword type.  For details and additional functions
for loading plugins and manipulating the keyword list see the
libextractor manpage (man 3 libextractor).  Java programmers
should note that a Java class that uses JNI to communicate with
libextractor is also available.


[ WRITING PLUGINS ]

The most complicated thing when writing a new plugin for libextractor
is the writing of the actual parser for the specific format.  Nevertheless,
the basic pattern is always the same.  The plugin library must be called
libextractor_XXX.so where XXX denotes the file format of the plugin or
otherwise identifies its purpose.  The library must export a method
libextractor_XXX_extract with the following signature:

The argument filename specifies the name of the file that is being
processed.  data is a pointer to the (typically mmapped) contents of
the file and size is the filesize.  Most plugins to not make use of
the filename and just directly parse data, staring by checking if the
header of data matches the specific format.  prev is the list of
keywords that have been extracted so far by other plugins for the
file.  The function is expected to return an updated list of keywords
and typically prev is returned if the format does not match the
expectations of the plugin.  Most plugins use a function like
addKeyword to extend the list:

A typical use of addKeyword is to add the mime-type once the file-format
has been established.  For example, the JPEG-extractor checks the first
bytes of the JPEG header and then either aborts or claims the file to be
a JPEG:

Note that the strdup here is important since the string will be
deallocated later, typically in EXTRACTOR_freeKeywords().  A list of
supported keyword classifications (EXTRACTOR_XXXX) can be found in the
extractor.h header file [19].


[ CONCLUSION ]

libextractor is a simple extensible C library for obtaining meta-data
from documents.  Its plugin architecture and broad support for formats
set it apart from format-specific tools.  The design is limited by that
libextractor cannot be used to update meta-data, which more specialized
tools often support. 


[ REFERENCES ]

[1] http://ovmj.org/libextractor/  
[2] http://getid3.sf.net/ 
[3] http://evidence.sf.net/
[4] http://ovmj.org/GNUnet/
[5] http://www.wotsit.org/
[6] http://dublincore.org/documents/dcmi-terms/
[7] http://freshmeat.net/projects/aviinfo/
[8] http://freshmeat.net/projects/id3edit/
[9] http://freshmeat.net/projects/jpeginfo/
[10] http://freshmeat.net/projects/ldt/
[11] http://freshmeat.net/projects/vocoditor/
[12] http://freshmeat.net/projects/file/
[13] http://enlightenment.org/
[14] http://www.gnu.org/licenses/gpl.html
[15] http://packages.debian.org/extract
[16] http://packages.debian.org/libextractor0-devel
[17] http://dmoz.org/Computers/Software/Typesetting/TeX/BibTeX/
[18] http://ovmj.org/GNUnet/download/bloomfilter.ps
[19] http://ovmj.org/libextractor/doxygen/html/extractor_8h-source.html