libextractor.texi (36104B)
1 \input texinfo @c -*- Texinfo -*- 2 @c % The structure of this document is based on the 3 @c % Texinfo manual from libgcrypt by Werner Koch and 4 @c % and Moritz Schulte. 5 @c %**start of header 6 @setfilename libextractor.info 7 @include version.texi 8 @settitle The GNU libextractor Reference Manual 9 @c Unify all the indices into concept index. 10 @syncodeindex fn cp 11 @syncodeindex vr cp 12 @syncodeindex ky cp 13 @syncodeindex pg cp 14 @syncodeindex tp cp 15 @c %**end of header 16 @copying 17 This manual is for GNU libextractor 18 (version @value{VERSION}, @value{UPDATED}), a library for metadata 19 extraction. 20 21 Copyright @copyright{} 2007, 2010, 2012 Christian Grothoff 22 23 @quotation 24 Permission is granted to copy, distribute and/or modify this document 25 under the terms of the GNU Free Documentation License, Version 1.3 26 or any later version published by the Free Software Foundation; 27 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover 28 Texts. A copy of the license is included in the section entitled ``GNU 29 Free Documentation License''. 30 @end quotation 31 @end copying 32 33 @dircategory Software libraries 34 @direntry 35 * Libextractor: (libextractor). Metadata extraction library. 36 @end direntry 37 38 39 40 @c 41 @c Titlepage 42 @c 43 @titlepage 44 @title The GNU libextractor Reference Manual 45 @subtitle Version @value{VERSION} 46 @subtitle @value{UPDATED} 47 @author Christian Grothoff (@email{christian@@grothoff.org}) 48 49 @page 50 @vskip 0pt plus 1filll 51 @insertcopying 52 @end titlepage 53 54 @summarycontents 55 @contents 56 57 58 @ifnottex 59 @node Top 60 @top The GNU libextractor Reference Manual 61 @insertcopying 62 @end ifnottex 63 64 @menu 65 * Introduction:: What is GNU libextractor. 66 * Preparation:: What you should do before using the library. 67 * Generalities:: General library functions and data types. 68 * Extracting meta data:: How to use GNU libextractor to obtain meta data. 69 * Language bindings:: How to use GNU libextractor from languages other than C. 70 * Utility functions:: Utility functions of GNU libextractor. 71 * Existing Plugins:: What plugins are available. 72 * Writing new Plugins:: How to write new plugins for GNU libextractor. 73 * Internal utility functions:: Utility functions of GNU libextractor for writing plugins. 74 * Reporting bugs:: How to report bugs or request new features. 75 76 Appendices 77 78 * GNU Free Documentation License:: Copying this manual. 79 80 Indices 81 82 * Index:: Index 83 @c * Function and Data Index:: Index of functions, variables and data types. 84 @c * Type Index:: Index of data types. 85 86 @end menu 87 88 89 90 @c ********************************************************** 91 @c ******************* Introduction *********************** 92 @c ********************************************************** 93 @node Introduction 94 @chapter Introduction 95 96 @cindex error handling 97 GNU libextractor is GNU's library for extracting meta data from 98 files. Meta data includes format information (such as mime type, 99 image dimensions, color depth, recording frequency), content 100 descriptions (such as document title or document description) and 101 copyright information (such as license, author and contributors). 102 Meta data extraction is an inherently uncertain business --- a parse 103 error can be a corrupt file, an incompatibility in the file format 104 version, an entirely different file format or a bug in the parser. As 105 a result of this uncertainty, GNU libextractor deliberately 106 avoids to ever report any errors. Unexpected file contents simply 107 result in less or possibly no meta data being extracted. 108 109 @cindex plugin 110 GNU libextractor uses plugins to handle various file formats. 111 Technically a plugin can support multiple file formats; however, most 112 plugins only support one particular format. By default, 113 GNU libextractor will use all plugins that are available and found 114 in the plugin installation directory. Applications can 115 request the use of only specific plugins or the exclusion of 116 certain plugins. 117 118 GNU libextractor is distributed with the @command{extract} 119 command@footnote{Some distributions ship @command{extract} in a 120 seperate package.} which is a command-line tool for extracting 121 meta data. @command{extract} is given a list of filenames and 122 prints the resulting meta data to the console. The @command{extract} 123 source code also serves as an advanced example for how to use 124 GNU libextractor. 125 126 This manual focuses on providing documentation for writing software 127 with GNU libextractor. The only relevant parts for end-users 128 are the chapter on compiling and installing GNU libextractor 129 (@xref{Preparation}.). Also, the chapter on existing plugins maybe of 130 interest (@xref{Existing Plugins}.). Additional documentation for 131 end-users can be find in the man page on @command{extract} (using 132 @verb{|man extract|}). 133 134 @cindex license 135 GNU libextractor is licensed under the GNU General Public License, 136 specifically, since version 0.7, GNU libextractor is licensed under GPLv3 137 @emph{or any later version}. 138 139 @node Preparation 140 @chapter Preparation 141 142 This chapter first describes the general build instructions that 143 should apply to all systems. Specific instructions for known problems 144 for particular platforms are then described in individual sections 145 afterwards. 146 147 Compiling GNU libextractor follows the standard GNU autotools build process 148 using @command{configure} and @command{make}. For details on the GNU 149 autotools build process, read the @file{INSTALL} file and query 150 @verb{|./configure --help|} for additional options. 151 152 GNU libextractor has various dependencies, most of which are optional. 153 Instead of specifying the names of the software packages, we 154 will give the list in terms of the names of the respective 155 Debian (wheezy) packages that should be installed. 156 157 You absolutely need: 158 159 @itemize @bullet 160 @item 161 libtool 162 @item 163 gcc 164 @item 165 make 166 @item 167 g++ 168 @item 169 libltdl7-dev 170 @end itemize 171 172 Recommended dependencies are: 173 @itemize @bullet 174 @item 175 zlib1g-dev 176 @item 177 libbz2-dev 178 @item 179 libgif-dev 180 @item 181 libvorbis-dev 182 @item 183 libflac-dev 184 @item 185 libmpeg2-4-dev 186 @item 187 librpm-dev 188 @item 189 libgtk2.0-dev or libgtk3.0-dev 190 @item 191 libgsf-1-dev 192 @item 193 libqt4-dev 194 @item 195 libpoppler-dev 196 @item 197 libexiv2-dev 198 @item 199 libavformat-dev 200 @item 201 libswscale-dev 202 @item 203 libgstreamer1.0-dev 204 @end itemize 205 206 For Subversion access and compilation one also needs: 207 @itemize @bullet 208 @item 209 subversion 210 @item 211 autoconf 212 @item 213 automake 214 @end itemize 215 216 Please notify us if we missed some dependencies (note that the list is 217 supposed to only list direct dependencies, not transitive 218 dependencies). 219 220 Once you have compiled and installed GNU libextractor, you should have a file 221 @file{extractor.h} installed in your @file{include/} directory. This 222 file should be the starting point for your C and C++ development with 223 GNU libextractor. The build process also installs the @file{extract} binary and 224 man pages for @file{extract} and GNU libextractor. The @file{extract} man page 225 documents the @file{extract} tool. The GNU libextractor man page gives a brief 226 summary of the C API for GNU libextractor. 227 228 @cindex packageing 229 @cindex directory structure 230 @cindex plugin 231 @cindex environment variables 232 @vindex LIBEXTRACTOR_PREFIX 233 When you install GNU libextractor, various plugins will be 234 installed in the @file{lib/libextractor/} directory. The main library 235 will be installed as @file{lib/libextractor.so}. Note that 236 GNU libextractor will attempt to find the plugins relative to the 237 path of the main library. Consequently, a package manager can move 238 the library and its plugins to a different location later --- as long 239 as the relative path between the main library and the plugins is 240 preserved. As a method of last resort, the user can specify an 241 environment variable @verb{|LIBEXTRACTOR_PREFIX|}. If 242 GNU libextractor cannot locate a plugin, it will look in 243 @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}. 244 245 246 @section Installation on GNU/Linux 247 248 Should work using the standard instructions without problems. 249 250 251 @section Installation on FreeBSD 252 253 Should work using the standard instructions without problems. 254 255 256 @section Installation on OpenBSD 257 258 OpenBSD 3.8 also doesn't have CODESET in @file{langinfo.h}. CODESET 259 is used in GNU libextractor in about three places. This causes problems 260 during compilation. 261 262 263 @section Installation on NetBSD 264 265 No reports so far. 266 267 268 @section Installation using MinGW 269 270 Linking -lstdc++ with the provided libtool fails on Cygwin, this 271 is a problem with libtool, there is unfortunately no flag to tell 272 libtool how to do its job on Cygwin and it seems that it cannot be the 273 default to set the library check to 'pass_all'. Patching libtool may 274 help. 275 276 Note: this is a rather dated report and may no longer apply. 277 278 279 @section Installation on OS X 280 281 libextractor has two installation methods on Mac OS X: it can be 282 installed as a Mac OS X framework or with the standard 283 @command{./configure; make; make install} shell commands. The 284 framework package is self-contained, but currently omits some of the 285 extractor plugins that can be compiled in if libextractor is installed 286 with @command{./configure; make; make install} (provided that the 287 required dependencies exist.) 288 289 @subsection Installing and uninstalling the framework 290 291 The binary framework is distributed as a disk image (@file{Extractor-x.x.xx.dmg}). 292 Installation is done by opening the disk image and clicking @file{Extractor.pkg} 293 inside it. The Mac OS X installer application will then run. The framework 294 is installed to the root volume's @file{/Library/Frameworks} folder and installing 295 will require admin privileges. 296 297 The framework can be uninstalled by dragging @* 298 @file{/Library/Frameworks/Extractor.framework} to the @file{Trash}. 299 300 301 @subsection Using the framework 302 303 In the framework, the @command{extract} command line tool can be found at @* 304 @file{/Library/Frameworks/Extractor.framework/Versions/Current/bin/extract} 305 306 The framework can be used in software projects as a framework or as a dynamic 307 library. 308 309 When using the framework as a dynamic library in projects using autotools, 310 one would most likely want to add @* 311 "-I/Library/Frameworks/Extractor.framework/Versions/Current/include" 312 to CPPFLAGS and @* 313 "-L/Library/Frameworks/Extractor.framework/Versions/Current/lib" 314 to LDFLAGS. 315 316 317 @subsection Example for using the framework 318 319 @example 320 @verbatim 321 // hello.c 322 #include <Extractor/extractor.h> 323 324 int 325 main (int argc, char **argv) 326 { 327 struct EXTRACTOR_PluginList *el; 328 el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); 329 // ... 330 EXTRACTOR_plugin_remove_all (el); 331 return 0; 332 } 333 @end verbatim 334 @end example 335 336 You can then compile the example using 337 338 @verbatim 339 $ gcc -o hello hello.c -framework Extractor 340 @end verbatim 341 342 @subsection Example for using the dynamic library 343 344 @example 345 @verbatim 346 // hello.c 347 #include <extractor.h> 348 int main() 349 { 350 struct EXTRACTOR_PluginList *el; 351 el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); 352 // ... 353 EXTRACTOR_plugin_remove_all (el); 354 return 0; 355 } 356 @end verbatim 357 @end example 358 359 You can then compile the example using 360 361 @verbatim 362 $ gcc -I/Library/Frameworks/Extractor.framework/Versions/Current/include \ 363 -o hello hello.c \ 364 -L/Library/Frameworks/Extractor.framework/Versions/Current/lib \ 365 -lextractor 366 @end verbatim 367 368 Notice the difference in the @code{#include} line. 369 370 371 372 373 374 375 @section Note to package maintainers 376 377 The suggested way to package GNU libextractor is to split it into 378 roughly the following binary packages: 379 380 @itemize @bullet 381 @item 382 libextractor (main library only, only hard dependency for other packages depending on GNU libextractor) 383 @item 384 extract (command-line tool and man page extract.1) 385 @item 386 libextractor-dev (extractor.h header and man page libextractor.3) 387 @item 388 libextractor-doc (this manual) 389 @item 390 libextractor-plugins (plugins without external dependencies; recommended but not required by extract and libextractor package) 391 @item 392 libextractor-plugin-XXX (plugin with dependency on libXXX, for example for XXX=mpeg this would be @file{libextractor_mpeg.so}) 393 @item 394 libextractor-plugins-all (meta package that requires all plugins except experimental plugins) 395 @end itemize 396 397 This would enable minimal installations (i.e. for embedded systems) to 398 not include any plugins, as well as moderate-size installations (that 399 do not trigger GTK and X11) for systems that have limited resources. 400 Right now, the MP4 plugin is experimental and does nothing and should 401 thus never be included at all. The gstreamer plugin is experimental 402 but largely works with the correct version of gstreamer and can thus 403 be packaged (especially if the dependency is available on the target 404 system) but should probably not be part of libextractor-plugins-all. 405 406 407 @node Generalities 408 @chapter Generalities 409 410 @section Introduction to the ``extract'' command 411 412 The @command{extract} command takes a list of file names as arguments, 413 extracts meta data from each of those files and prints the result to 414 the console. By default, @command{extract} will use all available 415 plugins and print all (non-binary) meta data that is found. 416 417 The set of plugins used by @command{extract} can be controlled using 418 the ``-l'' and ``-n'' options. Use ``-n'' to not load all of the 419 default plugins. Use ``-l NAME'' to specifically load a certain 420 plugin. For example, specify ``-n -l mime'' to only use the MIME 421 plugin. 422 423 Using the ``-p'' option the output of @command{extract} can be limited 424 to only certain keyword types. Similarly, using the ``-x'' option, 425 certain keyword types can be excluded. A list of all known keyword 426 types can be obtained using the ``-L'' option. 427 428 The output format of @command{extract} can be influenced with the 429 ``-V'' (more verbose, lists filenames), ``-g'' (grep-friendly, all 430 meta data on a single line per file) and ``-b'' (bibTeX style) 431 options. 432 433 @section Common usage examples for ``extract'' 434 435 @example 436 $ extract test/test.jpg 437 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1 438 mimetype - image/jpeg 439 440 $ extract -V -x comment test/test.jpg 441 Keywords for file test/test.jpg: 442 mimetype - image/jpeg 443 444 $ extract -p comment test/test.jpg 445 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1 446 447 $ extract -nV -l png.so -p comment test/test.jpg test/test.png 448 Keywords for file test/test.jpg: 449 Keywords for file test/test.png: 450 comment - Testing keyword extraction 451 @end example 452 453 454 @section Introduction to the libextractor library 455 456 Each public symbol exported by GNU libextractor has the prefix 457 @verb{|EXTRACTOR_|}. All-caps names are used for constants. For the 458 impatient, the minimal C code for using GNU libextractor (on the 459 executing binary itself) looks like this: 460 461 @verbatim 462 #include <extractor.h> 463 464 int 465 main (int argc, char ** argv) 466 { 467 struct EXTRACTOR_PluginList *plugins 468 = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); 469 EXTRACTOR_extract (plugins, argv[1], 470 NULL, 0, 471 &EXTRACTOR_meta_data_print, stdout); 472 EXTRACTOR_plugin_remove_all (plugins); 473 return 0; 474 } 475 @end verbatim 476 477 The minimal API illustrated by this example is actually sufficient for 478 many applications. The full external C API of GNU libextractor is described 479 in chapter @xref{Extracting meta data}. Bindings for other languages 480 are described in chapter @xref{Language bindings}. The API for 481 writing new plugins is described in chapter @xref{Writing new Plugins}. 482 483 Note that it is possible for GNU libextractor to encounter a @code{SIGPIPE} 484 during its execution. GNU libextractor --- as it is a library and as such 485 should not interfere with your main application --- does NOT install a 486 signal handler for @code{SIGPIPE}. You thus need to install a signal 487 handler (or at least tell your system to ignore @code{SIGPIPE}) if you 488 want to avoid unexpected problems during calls to GNU libextractor. 489 @cindex SIGPIPE 490 491 @node Extracting meta data 492 @chapter Extracting meta data 493 494 In order to extract meta data with GNU libextractor you first need to 495 load the respective plugins and then call the extraction API 496 with the plugins and the data to process. This section 497 documents how to load and unload plugins, the various types 498 and formats in which meta data is returned to the application 499 and finally the extraction API itself. 500 501 @menu 502 * Plugin management:: How to load and unload plugins 503 * Meta types:: About meta types 504 * Meta formats:: About meta formats 505 * Extracting:: How to use the extraction API 506 @end menu 507 508 509 @node Plugin management 510 @section Plugin management 511 512 @cindex reentrant 513 @cindex concurrency 514 @cindex threads 515 @cindex thread-safety 516 @tindex enum EXTRACTOR_Options 517 518 Using GNU libextractor from a multi-threaded parent process requires some 519 care. The problem is that on most platforms GNU libextractor starts 520 sub-processes for the actual extraction work. This is useful to 521 isolate the parent process from potential bugs; however, it can cause 522 problems if the parent process is multi-threaded. The issue is that 523 at the time of the fork, another thread of the application may hold a 524 lock (i.e. in gettext or libc). That lock would then never be 525 released in the child process (as the other thread is not present in 526 the child process). As a result, the child process would then 527 deadlock on trying to acquire the lock and never terminate. This has 528 actually been observed with a lock in GNU gettext that is triggered by 529 the plugin startup code when it interacts with libltdl. 530 531 The problem can be solved by loading the plugins using the 532 @code{EXTRACTOR_OPTION_IN_PROCESS} option, which will run GNU libextractor 533 in-process and thus avoid the locking issue. In this case, all of the 534 functions for loading and unloading plugins, including 535 @verb{|EXTRACTOR_plugin_add_defaults|} and 536 @verb{|EXTRACTOR_plugin_remove_all|}, are thread-safe and reentrant. 537 However, using the same plugin list from multiple threads at the same 538 time is not safe. 539 540 All plugin code is expected required to be reentrant and state-less, 541 but due to the extensive use of 3rd party libraries this cannot 542 be guaranteed. 543 544 545 @deftp {C Struct} EXTRACTOR_PluginList 546 @tindex struct EXTRACTOR_PluginList 547 548 A plugin list represents a set of GNU libextractor plugins. Most of 549 the GNU libextractor API is concerned with either constructing a 550 plugin list or using it to extract meta data. The internal representation 551 of the plugin list is of no concern to users or plugin developers. 552 @end deftp 553 554 555 @deftypefun void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins) 556 @findex EXTRACTOR_plugin_remove_all 557 558 Unload all of the plugins in the given list. 559 @end deftypefun 560 561 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name) 562 @findex EXTRACTOR_plugin_remove 563 564 Unloads a particular plugin. The given name should be the short name of the plugin, for example ``mime'' for the mime-type extractor or ``mpeg'' for the MPEG extractor. 565 @end deftypefun 566 567 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags) 568 @findex EXTRACTOR_plugin_add 569 570 Loads a particular plugin. The plugin is added to the existing list, which can be @code{NULL}. The second argument specifies the name of the plugin (i.e. ``ogg''). The third argument can be @code{NULL} and specifies plugin-specific options. Finally, the last argument specifies if the plugin should be executed out-of-process (@code{EXTRACTOR_OPTION_DEFAULT_POLICY}) or not. 571 @end deftypefun 572 573 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags) 574 @findex EXTRACTOR_plugin_add_config 575 576 Loads and unloads plugins based on a configuration string, modifying the existing list, which can be @code{NULL}. The string has the format ``[-]NAME(OPTIONS)@{:[-]NAME(OPTIONS)@}*''. Prefixing the plugin name with a ``-'' means that the plugin should be unloaded. 577 @end deftypefun 578 579 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags) 580 @findex EXTRACTOR_plugin_add_defaults 581 582 Loads all of the plugins in the plugin directory. This function is what most GNU libextractor applications should use to setup the plugins. 583 @end deftypefun 584 585 586 587 @node Meta types 588 @section Meta types 589 590 591 @tindex enum EXTRACTOR_MetaType 592 @findex EXTRACTOR_metatype_get_max 593 594 @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data. The total number can differ between different GNU libextractor releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function. All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}. 595 596 @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type) 597 @findex EXTRACTOR_metatype_to_string 598 @cindex gettext 599 @cindex internationalization 600 601 The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type. The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}). 602 @end deftypefun 603 604 @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type) 605 @findex EXTRACTOR_metatype_to_description 606 @cindex gettext 607 @cindex internationalization 608 609 The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type. The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive. The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}). 610 @end deftypefun 611 612 613 614 @node Meta formats 615 @section Meta formats 616 617 @tindex enum EXTRACTOR_MetaFormat 618 619 @verb{|enum EXTRACTOR_MetaFormat|} is a C enum which defines on a high level how the extracted meta data is represented. Currently, the library uses three formats: UTF-8 strings, C strings and binary data. A fourth value, @code{EXTRACTOR_METAFORMAT_UNKNOWN} is defined but not used. UTF-8 strings are 0-terminated strings that have been converted to UTF-8. The format code is @code{EXTRACTOR_METAFORMAT_UTF8}. Ideally, most text meta data will be of this format. Some file formats fail to specify the encoding used for the text. In this case, the text cannot be converted to UTF-8. However, the meta data is still known to be 0-terminated and presumably human-readable. In this case, the format code used is @code{EXTRACTOR_METAFORMAT_C_STRING}; however, this should not be understood to mean that the encoding is the same as that used by the C compiler. Finally, for binary data (mostly images), the format @code{EXTRACTOR_METAFORMAT_BINARY} is used. 620 621 Naturally this is not a precise description of the meta format. Plugins can provide a more precise description (if known) by providing the respective mime type of the meta data. For example, binary image meta data could be also tagged as ``image/png'' and normal text would typically be tagged as ``text/plain''. 622 623 624 625 @node Extracting 626 @section Extracting 627 628 @deftypefn {Function Pointer} int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len) 629 @tindex EXTRACTOR_MetaDataProcessor 630 631 Type of a function that libextractor calls for each meta data item found. 632 633 @table @var 634 635 @item cls 636 closure (user-defined) 637 638 @item plugin_name 639 name of the plugin that produced this value; special values can be used (i.e. '<zlib>' for zlib being used in the main libextractor library and yielding meta data); 640 641 @item type 642 libextractor-type describing the meta data; 643 644 @item format basic 645 format information about data 646 647 @item data_mime_type 648 mime-type of data (not of the original file); can be @code{NULL} (if mime-type is not known); 649 650 @item data 651 actual meta-data found 652 653 @item data_len 654 number of bytes in data 655 656 @end table 657 658 Return 0 to continue extracting, 1 to abort. 659 @end deftypefn 660 661 662 663 @deftypefun void EXTRACTOR_extract (struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls) 664 @findex EXTRACTOR_extract 665 @cindex reentrant 666 @cindex concurrency 667 @cindex threads 668 @cindex thread-safety 669 670 This is the main function for extracting keywords with GNU libextractor. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The @samp{filename} argument is optional and can be used to specify the name of a file to process. If @samp{filename} is @code{NULL}, then the @samp{data} argument must point to the in-memory data to extract meta data from. If @samp{filename} is non-@code{NULL}, @samp{data} can be @code{NULL}. If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes. Otherwise @samp{size} should be zero. For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}. The other arguments to @samp{proc} depend on the specific meta data found. 671 672 @cindex SIGBUS 673 @cindex bus error 674 Meta data extraction should never really fail --- at worst, GNU libextractor should not call @samp{proc} with any meta data. By design, GNU libextractor should never crash or leak memory, even given corrupt files as input. Note however, that running GNU libextractor on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process. As GNU libextractor typically runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. GNU libextractor will @emph{not} attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running GNU libextractor itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}). 675 676 @end deftypefun 677 678 679 @node Language bindings 680 @chapter Language bindings 681 @cindex Java 682 @cindex Mono 683 @cindex Perl 684 @cindex Python 685 @cindex PHP 686 @cindex Ruby 687 688 GNU libextractor works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main GNU libextractor website. Documentation for these bindings (if available) is part of the downloads for the respective binding. In all cases, a full installation of the C library is required before the binding can be installed. 689 690 @section Java 691 692 Compiling the GNU libextractor Java binding follows the usual process of 693 running @command{configure} and @command{make}. The result will be a 694 shared C library @file{libextractor_java.so} with the native code and 695 a JAR file (installed to @file{$PREFIX/share/java/libextractor.java}). 696 697 A minimal example for using GNU libextractor's Java binding would look 698 like this: 699 @verbatim 700 import org.gnu.libextractor.*; 701 import java.util.ArrayList; 702 703 public static void main(String[] args) { 704 Extractor ex = Extractor.getDefault(); 705 for (int i=0;i<args.length;i++) { 706 ArrayList keywords = ex.extract(args[i]); 707 System.out.println("Keywords for " + args[i] + ":"); 708 for (int j=0;j<keywords.size();j++) 709 System.out.println(keywords.get(j)); 710 } 711 } 712 @end verbatim 713 714 The GNU libextractor library and the @file{libextractor_java.so} JNI binding 715 have to be in the library search path for this to work. Furthermore, the 716 @file{libextractor.jar} file should be on the classpath. 717 718 Note that the API does not use Java 5 style generics in order to work 719 with older versions of Java. 720 721 @section Mono 722 723 his binding is undocumented at this point. 724 725 @section Perl 726 727 This binding is undocumented at this point. 728 729 @section Python 730 731 This binding is undocumented at this point. 732 733 @section PHP 734 735 This binding is undocumented at this point. 736 737 @section Ruby 738 739 This binding is undocumented at this point. 740 741 742 743 @node Utility functions 744 @chapter Utility functions 745 746 @cindex reentrant 747 @cindex concurrency 748 @cindex threads 749 @cindex thread-safety 750 This chapter describes various utility functions for GNU libextractor usage. All of the functions are reentrant. 751 752 @menu 753 * Utility Constants:: 754 * Meta data printing:: 755 @end menu 756 757 @node Utility Constants 758 @section Utility Constants 759 760 @findex EXTRACTOR_VERSION 761 The constant @verb{|EXTRACTOR_VERSION|} is a hexadecimal 762 representation of the version number of the installed libextractor 763 header. The hexadecimal format is 0xAABBCCDD where AA is the major 764 version (so far always 0), BB is the minor version, CC is the revision 765 and DD the patch number. For example, for version 0.5.18, we would 766 have AA=0, BB=5, CC=18 and DD=0. Minor releases such as 0.5.18a or 767 significant changes in unreleased versions would be marked with DD=1 768 or higher. 769 770 771 @node Meta data printing 772 @section Meta data printing 773 774 775 @findex EXTRACTOR_meta_data_print 776 The @verb{|EXTRACTOR_meta_data_print|} is a simple function which prints the meta data found with libextractor to a file. The function is mostly useful for debugging and as an example for how to manipulate the keyword list and can be passed as the @samp{proc} argument to @code{EXTRACTOR_extract}. The file to print to should be passed as @samp{proc_cls} (which must be of type @code{FILE *}), for example @code{stdout}. 777 778 779 780 @node Existing Plugins 781 @chapter Existing Plugins 782 783 @itemize @bullet 784 @item 785 ARCHIVE (using libarchive) 786 @item 787 DVI 788 @item 789 EXIV2 (using libexiv2, 0.23 or later preferred) 790 @item 791 FLAC (using libFLAC) 792 @item 793 GIF (using libgif) 794 @item 795 GSTREAMER (using libgstreamer v1.0 or later) 796 @item 797 HTML (using libtidy) 798 @item 799 IT 800 @item 801 JPEG (using libjpeg v8 or later) 802 @item 803 MAN 804 @item 805 MIDI (using libsmf) 806 @item 807 MIME (using libmagic) 808 @item 809 MPEG (using libmpeg2) 810 @item 811 NSF 812 @item 813 NSFE 814 @item 815 ODF 816 @item 817 OLE2 (with libgsf) 818 @item 819 OGG (with libogg) 820 @item 821 PNG 822 @item 823 PS 824 @item 825 RIFF 826 @item 827 RPM (using librpm) 828 @item 829 S3M 830 @item 831 SID 832 @item 833 ThumbnailFFMPEG (using libavformat and related libav-libraries, including libswscale) 834 @item 835 ThumbnailGtk (using libgtk) 836 @item 837 TIFF (with libtiff, tested with v4) 838 @item 839 WAV 840 @item 841 XM 842 @item 843 ZIP 844 @end itemize 845 846 @file{gzip} and @file{bzip2} compressed versions of these formats are 847 also supported (as well as meta data embedded by @file{gzip} itself) 848 if zlib or libbz2 are available. 849 850 @node Writing new Plugins 851 @chapter Writing new Plugins 852 853 Writing a new plugin for libextractor usually requires writing of or 854 interfacing with an actual parser for a specific format. How this is 855 can be accomplished depends on the format and cannot be specified in 856 general. However, care should be taken for the code to be reentrant 857 and highly fault-tolerant, especially with respect to malformed 858 inputs. 859 860 Plugins should start by verifying that the header of the data matches 861 the specific format and immediately return if that is not the case. 862 Even if the header matches the expected file format, plugins must not 863 assume that the remainder of the file is well formed. 864 865 The plugin library must be called libextractor_XXX.so, where XXX 866 denotes the file format of the plugin. The library must export a 867 method @verb{|libextractor_XXX_extract_method|}, with the following 868 signature: 869 @verbatim 870 void 871 EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec); 872 @end verbatim 873 874 @samp{ec} contains various information the plugin may need for its 875 execution. Most importantly, it contains functions for reading 876 (``read'') and seeking (``seek'') the input data and for returning 877 extracted data (``proc''). The ``config'' member can contain 878 additional configuration options. ``proc'' should be called on 879 each meta data item found. If ``proc'' returns non-zero, 880 processing should be aborted (if possible). 881 882 In order to test new plugins, the @file{extract} command can be run 883 with the options ``-ni'' and ``-l XXX'' . This will run the plugin 884 in-process (making it easier to debug) and without any of the other 885 plugins. 886 887 888 @section Example for a minimal extract method 889 890 The following example shows how a plugin can return the mime type of 891 a file. 892 @example 893 @verbatim 894 void 895 EXTRACTOR_mymime_extract (struct EXTRACTOR_ExtractContext *ec) 896 { 897 void *data; 898 ssize_t data_size, 899 900 if (-1 == (data_size = ec->read (ec->cls, &data, 4))) 901 return; /* read error */ 902 if (data_size < 4) 903 return; /* file too small */ 904 if (0 != memcmp (data, "\177ELF", 4)) 905 return; /* not ELF */ 906 if (0 != ec->proc (ec->cls, 907 "mymime", 908 EXTRACTOR_METATYPE_MIMETYPE, 909 EXTRACTOR_METAFORMAT_UTF8, 910 "text/plain", 911 "application/x-executable", 912 1 + strlen("application/x-executable"))) 913 return; 914 /* more calls to 'proc' here as needed */ 915 } 916 @end verbatim 917 @end example 918 919 920 @node Internal utility functions 921 @chapter Internal utility functions 922 923 Some plugins link against the @code{libextractor_common} library which 924 provides common abstractions needed by many plugins. This section 925 documents this internal API for plugin developers. Note that the headers 926 for this library are (intentionally) not installed: we do not consider 927 this API stable and it should hence only be used by plugins that are 928 build and shipped with GNU libextractor. Third-party plugins should 929 not use it. 930 931 @file{convert_numeric.h} defines various conversion functions for 932 numbers (in particular, byte-order conversion for floating point 933 numbers). 934 935 @file{unzip.h} defines an API for accessing compressed files. 936 937 @file{pack.h} provides an interpreter for unpacking structs of integer 938 numbers from streams and converting from big or little endian to host 939 byte order at the same time. 940 941 @file{convert.h} provides a function for character set conversion described 942 below. 943 944 @deftypefun {char *} EXTRACTOR_common_convert_to_utf8 (const char *input, size_t len, const char *charset) 945 @cindex UTF-8 946 @cindex character set 947 @findex EXTRACTOR_common_convert_to_utf8 948 Various GNU libextractor plugins make use of the internal 949 @file{convert.h} header which defines a function 950 951 @verb{|EXTRACTOR_common_convert_to_utf8|} which can be used to easily convert text from 952 any character set to UTF-8. This conversion is important since the 953 linked list of keywords that is returned by GNU libextractor is 954 expected to contain only UTF-8 strings. Naturally, proper conversion 955 may not always be possible since some file formats fail to specify the 956 character set. In that case, it is often better to not convert at 957 all. 958 959 The arguments to @verb{|EXTRACTOR_common_convert_to_utf8|} are the input string (which 960 does @emph{not} have to be zero-terminated), the length of the input 961 string, and the character set (which @emph{must} be zero-terminated). 962 Which character sets are supported depends on the platform, a list can 963 generally be obtained using the @command{iconv -l} command. The 964 return value from @verb{|EXTRACTOR_common_convert_to_utf8|} is a zero-terminated string 965 in UTF-8 format. The responsibility to free the string is with the 966 caller, so storing the string in the keyword list is acceptable. 967 @end deftypefun 968 969 970 971 972 973 @node Reporting bugs 974 @chapter Reporting bugs 975 976 @cindex bug 977 GNU libextractor uses the @url{https://gnunet.org/bugs/,Mantis bugtracking 978 system}. If possible, please report bugs there. You can also e-mail 979 the GNU libextractor mailinglist at @url{libextractor@@gnu.org}. 980 981 982 983 @c ********************************************************** 984 @c ******************* Appendices ************************* 985 @c ********************************************************** 986 987 @node GNU Free Documentation License 988 @appendix GNU Free Documentation License 989 990 @include fdl-1.3.texi 991 992 993 @node Index 994 @unnumbered Index 995 996 @printindex cp 997 998 @c @node Function and Data Index 999 @c @unnumbered Function and Data Index 1000 @c @printindex fn 1001 1002 @c @node Type Index 1003 @c @unnumbered Type Index 1004 @c @printindex tp 1005 1006 @bye