lj_plain.txt (8210B)
1 Reading file meta-data with extract and libextractor 2 3 4 by Christian Grothoff 5 6 7 INTRODUCTION 8 9 Modern file formats have provisions to annotate the contents of the 10 file with descriptive information. This development is driven by the 11 need to organize data better than merely by using filenames. The 12 problem with such meta-data is that the way it is stored is not 13 standardized. This makes it difficult for format-agnostic tools such 14 as file-managers or file-sharing applications to make use of the 15 information. Also it results in a plehora of format-specific tools 16 that are used to extract the meta-data, such as AVInfo [7] id3edit 17 [8], jpeginfo [9], ldd [10] or Vocoditor [11]. 18 19 In this article the libextractor library and the extract tool are 20 introduced. The goal of the libextractor project [1] is to provide a 21 uniform interface for obtaining meta-data from different file-formats. 22 libextractor is currently used by evidence [3] the file-manager for 23 the forthcomming version of Enlightenment [13] and GNUnet [4], an 24 anonymous, censorship-resistant peer-to-peer file-sharing system. The 25 extract tool is a simple command line interface to the library. 26 libextractor is licensed under the GNU Public License [14]. 27 28 libextractor is somewhat similar to the popular file [12] tool which 29 uses the first bytes in a file to guess the mime-type. libextractor 30 differs from file in two major ways. First, libextractor tries to 31 obtain much more information than just the mime-type. Depending on 32 the file-format libextractor can obtain additional information. 33 Examples include the software used to create the file, the author, a 34 description, the album, the image dimensions or the duration of the 35 movie. 36 37 libextractor achieves all of this by using both file-format specific 38 code for many popular formats. The list currently includes mp3, ogg, 39 real-media, mpeg, riff (avi), gif, jpeg, png, tiff, html, pdf, ps and 40 zip as well as generic methods such as mime-type detection. Many 41 other formats exist [5], but among the more popular formats only 42 various proprietary formats are not supported. At the end of the 43 article we will show how easy it is to integrate support for new 44 formats into the library. Integrating support for new formats is easy 45 since libextractor uses plugins to gather data. libextractor plugins 46 are shared libraries that typically provide code to parse one 47 particular format. libextractor gathers the meta-data obtained from 48 the various plugins and provides clients with a list of pairs 49 consisting of a classification and a character sequence. The 50 classification is used to organize the meta-data into categories like 51 title, creator, subject, description and so on [6]. 52 53 54 [ INSTALLING LIBEXTRACTOR AND USING EXTRACT ] 55 56 The simplest way to install libextractor is to use one of the binary 57 packages which are available on-line for many distributions. Note 58 that under Debian the extract tool is in a separate package extract 59 [15] and headers required to compile other applications against 60 libextractor are in libextractor0-devel [16]. If you want to compile 61 libextractor from source you will need an unusual amount of memory, 62 256 MB system memory is roughly the minimum since gcc will take about 63 200 MB to compile one of the plugins. Otherwise compiling by hand 64 follows the usual sequence: 65 66 After installing libextractor the extract tool can be used to obtain 67 meta-data from documents. By default the extract tool uses the 68 canonical set of plugins which consists of all file-format specific 69 plugins supported by the current version of libextractor together with 70 the mime-type detection plugin). For example, extract returns for the 71 webpage of the LinuxJournal something like this: 72 73 If you are a user of bibtex [17] the option -b is likely to come in 74 handy to automatically create bibtex entries from documents that have 75 been properly equipped with meta-data: 76 77 Another interesting option is "-B LANG". This option loads one of the 78 language specific but format agnostic plugins. These plugins attempt 79 to find plaintext in a document by matching strings in the document 80 against a dictionary. If you wondered why libextractor takes 200 MB 81 to compile, the answer lies in these plugins. In order to be able to 82 perform an fast dictionary search a bloomfilter [18] is created that 83 allows fast probabilistic matching, and gcc finds the resulting 84 datastructure a bit hard to swallow. The option -B is useful for 85 formats that are undocumented or just currently unsupported. Note 86 that the printable plugins typically print the entire text of the 87 document in order. A typical use is: 88 89 Which is a rather precise description of the text for a German 90 speaker. The supported languages at the moment are Danish (da), 91 German (de), English (en), Spanish (es), Italien (it) and Norvegian 92 (no). Supporting other languages is merely a question of adding 93 (free) dictionaries in an appropriate character set. Further options 94 are described in the extract manpage (man 1 extract). 95 96 97 [ USING LIBEXTRACTOR IN YOUR PROJECTS ] 98 99 The shortest program using libextractor looks roughly like this 100 (compilation requires passing the option -Lextractor to gcc): 101 102 The EXTRACTOR_KeywordList is a simple linked list containing 103 a keyword and a keyword type. For details and additional functions 104 for loading plugins and manipulating the keyword list see the 105 libextractor manpage (man 3 libextractor). Java programmers 106 should note that a Java class that uses JNI to communicate with 107 libextractor is also available. 108 109 110 [ WRITING PLUGINS ] 111 112 The most complicated thing when writing a new plugin for libextractor 113 is the writing of the actual parser for the specific format. Nevertheless, 114 the basic pattern is always the same. The plugin library must be called 115 libextractor_XXX.so where XXX denotes the file format of the plugin or 116 otherwise identifies its purpose. The library must export a method 117 libextractor_XXX_extract with the following signature: 118 119 The argument filename specifies the name of the file that is being 120 processed. data is a pointer to the (typically mmapped) contents of 121 the file and size is the filesize. Most plugins to not make use of 122 the filename and just directly parse data, staring by checking if the 123 header of data matches the specific format. prev is the list of 124 keywords that have been extracted so far by other plugins for the 125 file. The function is expected to return an updated list of keywords 126 and typically prev is returned if the format does not match the 127 expectations of the plugin. Most plugins use a function like 128 addKeyword to extend the list: 129 130 A typical use of addKeyword is to add the mime-type once the file-format 131 has been established. For example, the JPEG-extractor checks the first 132 bytes of the JPEG header and then either aborts or claims the file to be 133 a JPEG: 134 135 Note that the strdup here is important since the string will be 136 deallocated later, typically in EXTRACTOR_freeKeywords(). A list of 137 supported keyword classifications (EXTRACTOR_XXXX) can be found in the 138 extractor.h header file [19]. 139 140 141 142 [ CONCLUSION ] 143 144 libextractor is a simple extensible C library for obtaining meta-data 145 from documents. Its plugin architecture and broad support for formats 146 set it apart from format-specific tools. The design is limited by that 147 libextractor cannot be used to update meta-data, which more specialized 148 tools often support. 149 150 151 [ REFERENCES ] 152 153 [1] http://ovmj.org/libextractor/ 154 [2] http://getid3.sf.net/ 155 [3] http://evidence.sf.net/ 156 [4] http://ovmj.org/GNUnet/ 157 [5] http://www.wotsit.org/ 158 [6] http://dublincore.org/documents/dcmi-terms/ 159 [7] http://freshmeat.net/projects/aviinfo/ 160 [8] http://freshmeat.net/projects/id3edit/ 161 [9] http://freshmeat.net/projects/jpeginfo/ 162 [10] http://freshmeat.net/projects/ldt/ 163 [11] http://freshmeat.net/projects/vocoditor/ 164 [12] http://freshmeat.net/projects/file/ 165 [13] http://enlightenment.org/ 166 [14] http://www.gnu.org/licenses/gpl.html 167 [15] http://packages.debian.org/extract 168 [16] http://packages.debian.org/libextractor0-devel 169 [17] http://dmoz.org/Computers/Software/Typesetting/TeX/BibTeX/ 170 [18] http://ovmj.org/GNUnet/download/bloomfilter.ps 171 [19] http://ovmj.org/libextractor/doxygen/html/extractor_8h-source.html 172