lj.txt (10751B)
1 Reading file meta-data with extract and libextractor 2 3 4 by Christian Grothoff 5 6 7 INTRODUCTION 8 9 Modern file formats have provisions to annotate the contents of the 10 file with descriptive information. This development is driven by the 11 need to find a better way to organize data than merely using 12 filenames. The problem with such meta-data is that the way it is 13 stored is not standardized across different file formats. This makes 14 it difficult for format-agnostic tools such as file managers or 15 file-sharing applications to make use of the information. It also 16 results in a plethora of format-specific tools that are used to 17 extract the meta-data, such as AVInfo [7] id3edit [8], jpeginfo [9] 18 or Vocoditor [10]. 19 20 In this article, the libextractor library and the extract tool are 21 introduced. The goal of the libextractor project [1] is to provide a 22 uniform interface for obtaining meta-data from different file formats. 23 libextractor is currently used by evidence [3], the file-manager for 24 the forthcomming version of Enlightenment [13], and GNUnet [4], an 25 anonymous, censorship-resistant peer-to-peer file-sharing system. The 26 extract tool is a command-line interface to the library. libextractor 27 is licensed under the GNU Public License [14]. 28 29 libextractor shares some similarities with the popular file [11] tool 30 which uses the first bytes in a file to guess the mime-type. 31 libextractor differs from file that it tries to obtain much more 32 information than just the mime-type. Depending on the file format, 33 libextractor can obtain additional information. Examples of this 34 extra information include the name of the software used to create the 35 file, the author, descriptions, album titles, image dimensions or the 36 duration of the movie. 37 38 libextractor achieves this by using specific parser code for many 39 popular formats. The list currently includes mp3, ogg, real-media, 40 mpeg, riff (avi), gif, jpeg, png, tiff, html, pdf, ps and zip as well 41 as generic methods such as mime-type detection. Many other formats 42 exist [5], but among the more popular formats only a few proprietary 43 formats are not supported. Integrating support for new formats is 44 easy since libextractor uses plugins to gather data. libextractor 45 plugins are shared libraries that typically provide code to parse one 46 particular format. At the end of the article we will show how to 47 integrate support for new formats into the library. libextractor 48 gathers the meta-data obtained from the various plugins and provides 49 clients with a list of pairs consisting of a classification and a 50 character sequence. The classification is used to organize the 51 meta-data into categories like title, creator, subject, description 52 and so on [6]. 53 54 55 [ INSTALLING LIBEXTRACTOR AND USING EXTRACT ] 56 57 The simplest way to install libextractor is to use one of the binary 58 packages which are available online for many distributions. Note 59 that under Debian, the extract tool is in a separate package extract 60 [15] and headers required to compile other applications against 61 libextractor are in libextractor0-devel [16]. If you want to compile 62 libextractor from source you will need an unusual amount of memory: 63 256 MB system memory is roughly the minimum, since gcc will take about 64 200 MB to compile one of the plugins. Otherwise, compiling by hand 65 follows the usual sequence: 66 67 $ wget http://ovmj.org/libextractor/download/libextractor-0.3.1.tar.gz 68 $ tar xvfz libextractor-0.3.1.tar.gz 69 $ cd libextractor-0.3.1 70 $ ./configure --prefix=/usr/local 71 $ make 72 # make install 73 74 After installing libextractor, the extract tool can be used to obtain 75 meta-data from documents. By default, the extract tool uses the 76 canonical set of plugins, which consists of all file-format-specific 77 plugins supported by the current version of libextractor together with 78 the mime-type detection plugin. For example, extract returns something 79 like the following for LinuxJournal's webpage: 80 81 $ wget -q http://www.linuxjournal.com/ 82 $ extract index.html 83 description - The Monthly Magazine of the Linux Community 84 keywords - linux, linux journal, magazine 85 author - Linux Journal - The Premier Magazine of the Linux Community 86 title - Linux Journal - The Premier Magazine of the Linux Community 87 88 If you are a user of bibtex [12] the option -b is likely to come in 89 handy to automatically create bibtex entries from documents that have 90 been properly equipped with meta-data: 91 92 $ wget -q http://www.copyright.gov/legislation/dmca.pdf 93 $ extract -b ~/dmca.pdf 94 % BiBTeX file 95 @misc{ unite2001the_d, 96 title = "The Digital Millennium Copyright Act of 1998", 97 author = "United States Copyright Office - jmf", 98 note = "digital millennium copyright act circumvention technological protection management information online service provider liability limitation computer maintenance competitiion repair ephemeral recording webcasting distance education study vessel hull", 99 year = "2001", 100 month = "10", 101 key = "Copyright Office Summary of the DMCA", 102 pages = "18" 103 } 104 105 Another interesting option is "-B LANG". This option loads one of the 106 language specific (but format-agnostic) plugins. These plugins 107 attempt to find plaintext in a document by matching strings in the 108 document against a dictionary. If the need for 200 MB of memory to 109 compile libextractor seems mysterious, the answer lies in these 110 plugins. In order to be able to perform a fast dictionary search, a 111 bloomfilter [17] is created that allows fast probabilistic matching; 112 gcc finds the resulting datastructure a bit hard to swallow. The 113 option -B is useful for formats that are undocumented or currently 114 unsupported. Note that the printable plugins typically print the 115 entire text of the document in order. A typical use is: 116 117 $ wget -q http://www.bayern.de/HDBG/polges.doc 118 $ extract -B de polges.doc | head -n 4 119 unknown - FEE Politische Geschichte Bayerns Herausgegeben vom Haus der Geschichte als Heft der zur Geschichte und Kultur Redaktion Manfred Bearbeitung Otto Copyright Haus der Geschichte München Gestaltung fürs Internet Rudolf Inhalt im. 120 unknown - und das Deutsche Reich. 121 unknown - und seine. 122 unknown - Henker im Zeitalter von Reformation und Gegenreformation. 123 124 This is a rather precise description of the text for a German 125 speaker. The supported languages at the moment are Danish (da), 126 German (de), English (en), Spanish (es), Italian (it) and Norvegian 127 (no). Supporting other languages is merely a question of adding 128 (free) dictionaries in an appropriate character set. Further options 129 are described in the extract manpage (man 1 extract). 130 131 132 [ USING LIBEXTRACTOR IN YOUR PROJECTS ] 133 134 The shortest program using libextractor looks roughly like this 135 (compilation requires passing the option -Lextractor to gcc): 136 137 #include <extractor.h> 138 int main(int argc, char * argv[]) { 139 EXTRACTOR_ExtractorList * plugins; 140 EXTRACTOR_KeywordList * md_list; 141 plugins = EXTRACTOR_loadDefaultLibraries(); 142 md_list = EXTRACTOR_getKeywords(plugins, argv[1]); 143 EXTRACTOR_printKeywords(stdout, md_list); 144 EXTRACTOR_freeKeywords(md_list); 145 EXTRACTOR_removeAll(plugins); /* unload plugins */ 146 } 147 148 The EXTRACTOR_KeywordList is a simple linked list containing 149 a keyword and a keyword type. For details and additional functions 150 for loading plugins and manipulating the keyword list, see the 151 libextractor manpage (man 3 libextractor). Java programmers 152 should note that a Java class that uses JNI to communicate with 153 libextractor is also available. 154 155 156 [ WRITING PLUGINS ] 157 158 The most complicated thing when writing a new plugin for libextractor 159 is the writing of the actual parser for a specific format. 160 Nevertheless, the basic pattern is always the same. The plugin 161 library must be called libextractor_XXX.so where XXX denotes the file 162 format of the plugin. The library must export a method 163 libextractor_XXX_extract with the following signature: 164 165 struct EXTRACTOR_Keywords * 166 libextractor_XXX_extract 167 (char * filename, 168 char * data, 169 size_t size, 170 struct EXTRACTOR_Keywords * prev); 171 172 The argument filename specifies the name of the file that is being 173 processed. data is a pointer to the (typically mmapped) contents of 174 the file, and size is the filesize. Most plugins to not make use of 175 the filename and just directly parse data directly, staring by 176 verifying that the header of the data matches the specific format. 177 prev is the list of keywords that have been extracted so far by other 178 plugins for the file. The function is expected to return an updated 179 list of keywords. If the format does not match the expectations of 180 the plugin, prev is returned. Most plugins use a function like 181 addKeyword to extend the list: 182 183 static void addKeyword 184 (struct EXTRACTOR_Keywords ** list, 185 char * keyword, 186 EXTRACTOR_KeywordType type) 187 { 188 EXTRACTOR_KeywordList * next; 189 next = malloc(sizeof(EXTRACTOR_KeywordList)); 190 next->next = *list; 191 next->keyword = keyword; 192 next->keywordType = type; 193 *list = next; 194 } 195 196 A typical use of addKeyword is to add the mime-type once the file format 197 has been established. For example, the JPEG-extractor checks the first 198 bytes of the JPEG header and then either aborts or claims the file to be 199 a JPEG: 200 201 if ( (data[0] != 0xFF) || (data[1] != 0xD8) ) 202 return prev; /* not a JPEG */ 203 addKeyword(&prev, 204 strdup("image/jpeg"), 205 EXTRACTOR_MIMETYPE); 206 /* ... more parsing code here ... */ 207 return prev; 208 209 Note that the strdup here is important since the string will be 210 deallocated later, typically in EXTRACTOR_freeKeywords(). A list of 211 supported keyword classifications (EXTRACTOR_XXXX) can be found in the 212 extractor.h header file [18]. 213 214 215 216 [ CONCLUSION ] 217 218 libextractor is a simple extensible C library for obtaining meta-data 219 from documents. Its plugin architecture and broad support for formats 220 set it apart from format-specific tools. The design is limited by the 221 fact that libextractor cannot be used to update meta-data, which more 222 specialized tools typically support. 223 224 225 [ REFERENCES ] 226 227 [1] http://ovmj.org/libextractor/ 228 [2] http://getid3.sf.net/ 229 [3] http://evidence.sf.net/ 230 [4] http://ovmj.org/GNUnet/ 231 [5] http://www.wotsit.org/ 232 [6] http://dublincore.org/documents/dcmi-terms/ 233 [7] http://freshmeat.net/projects/aviinfo/ 234 [8] http://freshmeat.net/projects/id3edit/ 235 [9] http://freshmeat.net/projects/jpeginfo/ 236 [10] http://freshmeat.net/projects/vocoditor/ 237 [11] http://freshmeat.net/projects/file/ 238 [12] http://dmoz.org/Computers/Software/Typesetting/TeX/BibTeX/ 239 [13] http://enlightenment.org/ 240 [14] http://www.gnu.org/licenses/gpl.html 241 [15] http://packages.debian.org/extract 242 [16] http://packages.debian.org/libextractor0-devel 243 [17] http://ovmj.org/GNUnet/download/bloomfilter.ps 244 [18] http://ovmj.org/libextractor/doxygen/html/extractor_8h-source.html 245