libextractor.texi (36264B)
1 \input texinfo @c -*- Texinfo -*- 2 @c % The structure of this document is based on the 3 @c % Texinfo manual from libgcrypt by Werner Koch and 4 @c % and Moritz Schulte. 5 @c %**start of header 6 @setfilename libextractor.info 7 @include version.texi 8 @settitle The GNU libextractor Reference Manual 9 @c Unify all the indices into concept index. 10 @syncodeindex fn cp 11 @syncodeindex vr cp 12 @syncodeindex ky cp 13 @syncodeindex pg cp 14 @syncodeindex tp cp 15 @c %**end of header 16 @copying 17 This manual is for GNU libextractor 18 (version @value{VERSION}, @value{UPDATED}), a library for metadata 19 extraction. 20 21 Copyright @copyright{} 2007, 2010, 2012 Christian Grothoff 22 23 @quotation 24 Permission is granted to copy, distribute and/or modify this document 25 under the terms of the GNU Free Documentation License, Version 1.3 26 or any later version published by the Free Software Foundation; 27 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover 28 Texts. A copy of the license is included in the section entitled ``GNU 29 Free Documentation License''. 30 @end quotation 31 @end copying 32 33 @dircategory Software libraries 34 @direntry 35 * Libextractor: (libextractor). Metadata extraction library. 36 @end direntry 37 38 39 40 @c 41 @c Titlepage 42 @c 43 @titlepage 44 @title The GNU libextractor Reference Manual 45 @subtitle Version @value{VERSION} 46 @subtitle @value{UPDATED} 47 @author Christian Grothoff (@email{christian@@grothoff.org}) 48 49 @page 50 @vskip 0pt plus 1filll 51 @insertcopying 52 @end titlepage 53 54 @summarycontents 55 @contents 56 57 58 @ifnottex 59 @node Top 60 @top The GNU libextractor Reference Manual 61 @insertcopying 62 @end ifnottex 63 64 @menu 65 * Introduction:: What is GNU libextractor. 66 * Preparation:: What you should do before using the library. 67 * Generalities:: General library functions and data types. 68 * Extracting meta data:: How to use GNU libextractor to obtain meta data. 69 * Language bindings:: How to use GNU libextractor from languages other than C. 70 * Utility functions:: Utility functions of GNU libextractor. 71 * Existing Plugins:: What plugins are available. 72 * Writing new Plugins:: How to write new plugins for GNU libextractor. 73 * Internal utility functions:: Utility functions of GNU libextractor for writing plugins. 74 * Reporting bugs:: How to report bugs or request new features. 75 76 Appendices 77 78 * GNU Free Documentation License:: Copying this manual. 79 80 Indices 81 82 * Index:: Index 83 @c * Function and Data Index:: Index of functions, variables and data types. 84 @c * Type Index:: Index of data types. 85 86 @end menu 87 88 89 90 @c ********************************************************** 91 @c ******************* Introduction *********************** 92 @c ********************************************************** 93 @node Introduction 94 @chapter Introduction 95 96 @cindex error handling 97 GNU libextractor is GNU's library for extracting meta data from 98 files. Meta data includes format information (such as mime type, 99 image dimensions, color depth, recording frequency), content 100 descriptions (such as document title or document description) and 101 copyright information (such as license, author and contributors). 102 Meta data extraction is an inherently uncertain business --- a parse 103 error can be a corrupt file, an incompatibility in the file format 104 version, an entirely different file format or a bug in the parser. As 105 a result of this uncertainty, GNU libextractor deliberately 106 avoids to ever report any errors. Unexpected file contents simply 107 result in less or possibly no meta data being extracted. 108 109 @cindex plugin 110 GNU libextractor uses plugins to handle various file formats. 111 Technically a plugin can support multiple file formats; however, most 112 plugins only support one particular format. By default, 113 GNU libextractor will use all plugins that are available and found 114 in the plugin installation directory. Applications can 115 request the use of only specific plugins or the exclusion of 116 certain plugins. 117 118 GNU libextractor is distributed with the @command{extract} 119 command@footnote{Some distributions ship @command{extract} in a 120 seperate package.} which is a command-line tool for extracting 121 meta data. @command{extract} is given a list of filenames and 122 prints the resulting meta data to the console. The @command{extract} 123 source code also serves as an advanced example for how to use 124 GNU libextractor. 125 126 This manual focuses on providing documentation for writing software 127 with GNU libextractor. The only relevant parts for end-users 128 are the chapter on compiling and installing GNU libextractor 129 (@xref{Preparation}.). Also, the chapter on existing plugins maybe of 130 interest (@xref{Existing Plugins}.). Additional documentation for 131 end-users can be find in the man page on @command{extract} (using 132 @verb{|man extract|}). 133 134 @cindex license 135 GNU libextractor is licensed under the GNU General Public License, 136 specifically, since version 0.7, GNU libextractor is licensed under GPLv3 137 @emph{or any later version}. 138 139 @node Preparation 140 @chapter Preparation 141 142 This chapter first describes the general build instructions that 143 should apply to all systems. Specific instructions for known problems 144 for particular platforms are then described in individual sections 145 afterwards. 146 147 Compiling GNU libextractor follows the standard GNU autotools build process 148 using @command{configure} and @command{make}. For details on the GNU 149 autotools build process, read the @file{INSTALL} file and query 150 @verb{|./configure --help|} for additional options. 151 152 GNU libextractor has various dependencies, most of which are optional. 153 Instead of specifying the names of the software packages, we 154 will give the list in terms of the names of the respective 155 Debian (wheezy) packages that should be installed. 156 157 You absolutely need: 158 159 @itemize @bullet 160 @item 161 libtool 162 @item 163 gcc 164 @item 165 make 166 @item 167 g++ 168 @item 169 libltdl7-dev 170 @end itemize 171 172 Recommended dependencies are: 173 @itemize @bullet 174 @item 175 zlib1g-dev 176 @item 177 libbz2-dev 178 @item 179 libgif-dev 180 @item 181 libvorbis-dev 182 @item 183 libflac-dev 184 @item 185 libmpeg2-4-dev 186 @item 187 librpm-dev 188 @item 189 libgtk2.0-dev or libgtk3.0-dev 190 @item 191 libgsf-1-dev 192 @item 193 libqt4-dev 194 @item 195 libpoppler-dev 196 @item 197 libexiv2-dev 198 @item 199 libavformat-dev 200 @item 201 libswscale-dev 202 @item 203 libgstreamer1.0-dev 204 @end itemize 205 206 For Subversion access and compilation one also needs: 207 @itemize @bullet 208 @item 209 subversion 210 @item 211 autoconf 212 @item 213 automake 214 @end itemize 215 216 Please notify us if we missed some dependencies (note that the list is 217 supposed to only list direct dependencies, not transitive 218 dependencies). 219 220 Once you have compiled and installed GNU libextractor, you should have a file 221 @file{extractor.h} installed in your @file{include/} directory. This 222 file should be the starting point for your C and C++ development with 223 GNU libextractor. The build process also installs the @file{extract} binary and 224 man pages for @file{extract} and GNU libextractor. The @file{extract} man page 225 documents the @file{extract} tool. The GNU libextractor man page gives a brief 226 summary of the C API for GNU libextractor. 227 228 @cindex packageing 229 @cindex directory structure 230 @cindex plugin 231 @cindex environment variables 232 @vindex LIBEXTRACTOR_PREFIX 233 When you install GNU libextractor, various plugins will be 234 installed in the @file{lib/libextractor/} directory. The main library 235 will be installed as @file{lib/libextractor.so}. Note that 236 GNU libextractor will attempt to find the plugins relative to the 237 path of the main library. Consequently, a package manager can move 238 the library and its plugins to a different location later --- as long 239 as the relative path between the main library and the plugins is 240 preserved. As a method of last resort, the user can specify an 241 environment variable @verb{|LIBEXTRACTOR_PREFIX|}. If 242 GNU libextractor cannot locate a plugin, it will look in 243 @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}. 244 245 246 @section Installation on GNU/Linux 247 248 Should work using the standard instructions without problems. 249 250 251 @section Installation on FreeBSD 252 253 Should work using the standard instructions without problems. 254 255 256 @section Installation on OpenBSD 257 258 OpenBSD 3.8 also doesn't have CODESET in @file{langinfo.h}. CODESET 259 is used in GNU libextractor in about three places. This causes problems 260 during compilation. 261 262 263 @section Installation on NetBSD 264 265 No reports so far. 266 267 268 @section Installation using MinGW 269 270 Linking -lstdc++ with the provided libtool fails on Cygwin, this 271 is a problem with libtool, there is unfortunately no flag to tell 272 libtool how to do its job on Cygwin and it seems that it cannot be the 273 default to set the library check to 'pass_all'. Patching libtool may 274 help. 275 276 Note: this is a rather dated report and may no longer apply. 277 278 279 @section Installation on OS X 280 281 libextractor has two installation methods on Mac OS X: it can be 282 installed as a Mac OS X framework or with the standard 283 @command{./configure; make; make install} shell commands. The 284 framework package is self-contained, but currently omits some of the 285 extractor plugins that can be compiled in if libextractor is installed 286 with @command{./configure; make; make install} (provided that the 287 required dependencies exist.) 288 289 @subsection Installing and uninstalling the framework 290 291 The binary framework is distributed as a disk image (@file{Extractor-x.x.xx.dmg}). 292 Installation is done by opening the disk image and clicking @file{Extractor.pkg} 293 inside it. The Mac OS X installer application will then run. The framework 294 is installed to the root volume's @file{/Library/Frameworks} folder and installing 295 will require admin privileges. 296 297 The framework can be uninstalled by dragging @* 298 @file{/Library/Frameworks/Extractor.framework} to the @file{Trash}. 299 300 301 @subsection Using the framework 302 303 In the framework, the @command{extract} command line tool can be found at @* 304 @file{/Library/Frameworks/Extractor.framework/Versions/Current/bin/extract} 305 306 The framework can be used in software projects as a framework or as a dynamic 307 library. 308 309 When using the framework as a dynamic library in projects using autotools, 310 one would most likely want to add @* 311 "-I/Library/Frameworks/Extractor.framework/Versions/Current/include" 312 to CPPFLAGS and @* 313 "-L/Library/Frameworks/Extractor.framework/Versions/Current/lib" 314 to LDFLAGS. 315 316 317 @subsection Example for using the framework 318 319 @example 320 @verbatim 321 // hello.c 322 #include <Extractor/extractor.h> 323 324 int 325 main (int argc, char **argv) 326 { 327 struct EXTRACTOR_PluginList *el; 328 el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); 329 // ... 330 EXTRACTOR_plugin_remove_all (el); 331 return 0; 332 } 333 @end verbatim 334 @end example 335 336 You can then compile the example using 337 338 @verbatim 339 $ gcc -o hello hello.c -framework Extractor 340 @end verbatim 341 342 @subsection Example for using the dynamic library 343 344 @example 345 @verbatim 346 // hello.c 347 #include <extractor.h> 348 int main() 349 { 350 struct EXTRACTOR_PluginList *el; 351 el = EXTRACTOR_plugin_load_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); 352 // ... 353 EXTRACTOR_plugin_remove_all (el); 354 return 0; 355 } 356 @end verbatim 357 @end example 358 359 You can then compile the example using 360 361 @verbatim 362 $ gcc -I/Library/Frameworks/Extractor.framework/Versions/Current/include \ 363 -o hello hello.c \ 364 -L/Library/Frameworks/Extractor.framework/Versions/Current/lib \ 365 -lextractor 366 @end verbatim 367 368 Notice the difference in the @code{#include} line. 369 370 371 372 373 374 375 @section Note to package maintainers 376 377 The suggested way to package GNU libextractor is to split it into 378 roughly the following binary packages: 379 380 @itemize @bullet 381 @item 382 libextractor (main library only, only hard dependency for other packages depending on GNU libextractor) 383 @item 384 extract (command-line tool and man page extract.1) 385 @item 386 libextractor-dev (extractor.h header and man page libextractor.3) 387 @item 388 libextractor-doc (this manual) 389 @item 390 libextractor-plugins (plugins without external dependencies; recommended but not required by extract and libextractor package) 391 @item 392 libextractor-plugin-XXX (plugin with dependency on libXXX, for example for XXX=mpeg this would be @file{libextractor_mpeg.so}) 393 @item 394 libextractor-plugins-all (meta package that requires all plugins except experimental plugins) 395 @end itemize 396 397 This would enable minimal installations (i.e. for embedded systems) to 398 not include any plugins, as well as moderate-size installations (that 399 do not trigger GTK and X11) for systems that have limited resources. 400 Right now, the MP4 plugin is experimental and does nothing and should 401 thus never be included at all; QuickTime, MP4, M4A and 3GPP files are 402 instead handled by the @file{libextractor_qt.so} plugin, which only 403 depends on zlib and is part of libextractor-plugins. The gstreamer plugin is experimental 404 but largely works with the correct version of gstreamer and can thus 405 be packaged (especially if the dependency is available on the target 406 system) but should probably not be part of libextractor-plugins-all. 407 408 409 @node Generalities 410 @chapter Generalities 411 412 @section Introduction to the ``extract'' command 413 414 The @command{extract} command takes a list of file names as arguments, 415 extracts meta data from each of those files and prints the result to 416 the console. By default, @command{extract} will use all available 417 plugins and print all (non-binary) meta data that is found. 418 419 The set of plugins used by @command{extract} can be controlled using 420 the ``-l'' and ``-n'' options. Use ``-n'' to not load all of the 421 default plugins. Use ``-l NAME'' to specifically load a certain 422 plugin. For example, specify ``-n -l mime'' to only use the MIME 423 plugin. 424 425 Using the ``-p'' option the output of @command{extract} can be limited 426 to only certain keyword types. Similarly, using the ``-x'' option, 427 certain keyword types can be excluded. A list of all known keyword 428 types can be obtained using the ``-L'' option. 429 430 The output format of @command{extract} can be influenced with the 431 ``-V'' (more verbose, lists filenames), ``-g'' (grep-friendly, all 432 meta data on a single line per file) and ``-b'' (bibTeX style) 433 options. 434 435 @section Common usage examples for ``extract'' 436 437 @example 438 $ extract test/test.jpg 439 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1 440 mimetype - image/jpeg 441 442 $ extract -V -x comment test/test.jpg 443 Keywords for file test/test.jpg: 444 mimetype - image/jpeg 445 446 $ extract -p comment test/test.jpg 447 comment - (C) 2001 by Christian Grothoff, using gimp 1.2 1 448 449 $ extract -nV -l png.so -p comment test/test.jpg test/test.png 450 Keywords for file test/test.jpg: 451 Keywords for file test/test.png: 452 comment - Testing keyword extraction 453 @end example 454 455 456 @section Introduction to the libextractor library 457 458 Each public symbol exported by GNU libextractor has the prefix 459 @verb{|EXTRACTOR_|}. All-caps names are used for constants. For the 460 impatient, the minimal C code for using GNU libextractor (on the 461 executing binary itself) looks like this: 462 463 @verbatim 464 #include <extractor.h> 465 466 int 467 main (int argc, char ** argv) 468 { 469 struct EXTRACTOR_PluginList *plugins 470 = EXTRACTOR_plugin_add_defaults (EXTRACTOR_OPTION_DEFAULT_POLICY); 471 EXTRACTOR_extract (plugins, argv[1], 472 NULL, 0, 473 &EXTRACTOR_meta_data_print, stdout); 474 EXTRACTOR_plugin_remove_all (plugins); 475 return 0; 476 } 477 @end verbatim 478 479 The minimal API illustrated by this example is actually sufficient for 480 many applications. The full external C API of GNU libextractor is described 481 in chapter @xref{Extracting meta data}. Bindings for other languages 482 are described in chapter @xref{Language bindings}. The API for 483 writing new plugins is described in chapter @xref{Writing new Plugins}. 484 485 Note that it is possible for GNU libextractor to encounter a @code{SIGPIPE} 486 during its execution. GNU libextractor --- as it is a library and as such 487 should not interfere with your main application --- does NOT install a 488 signal handler for @code{SIGPIPE}. You thus need to install a signal 489 handler (or at least tell your system to ignore @code{SIGPIPE}) if you 490 want to avoid unexpected problems during calls to GNU libextractor. 491 @cindex SIGPIPE 492 493 @node Extracting meta data 494 @chapter Extracting meta data 495 496 In order to extract meta data with GNU libextractor you first need to 497 load the respective plugins and then call the extraction API 498 with the plugins and the data to process. This section 499 documents how to load and unload plugins, the various types 500 and formats in which meta data is returned to the application 501 and finally the extraction API itself. 502 503 @menu 504 * Plugin management:: How to load and unload plugins 505 * Meta types:: About meta types 506 * Meta formats:: About meta formats 507 * Extracting:: How to use the extraction API 508 @end menu 509 510 511 @node Plugin management 512 @section Plugin management 513 514 @cindex reentrant 515 @cindex concurrency 516 @cindex threads 517 @cindex thread-safety 518 @tindex enum EXTRACTOR_Options 519 520 Using GNU libextractor from a multi-threaded parent process requires some 521 care. The problem is that on most platforms GNU libextractor starts 522 sub-processes for the actual extraction work. This is useful to 523 isolate the parent process from potential bugs; however, it can cause 524 problems if the parent process is multi-threaded. The issue is that 525 at the time of the fork, another thread of the application may hold a 526 lock (i.e. in gettext or libc). That lock would then never be 527 released in the child process (as the other thread is not present in 528 the child process). As a result, the child process would then 529 deadlock on trying to acquire the lock and never terminate. This has 530 actually been observed with a lock in GNU gettext that is triggered by 531 the plugin startup code when it interacts with libltdl. 532 533 The problem can be solved by loading the plugins using the 534 @code{EXTRACTOR_OPTION_IN_PROCESS} option, which will run GNU libextractor 535 in-process and thus avoid the locking issue. In this case, all of the 536 functions for loading and unloading plugins, including 537 @verb{|EXTRACTOR_plugin_add_defaults|} and 538 @verb{|EXTRACTOR_plugin_remove_all|}, are thread-safe and reentrant. 539 However, using the same plugin list from multiple threads at the same 540 time is not safe. 541 542 All plugin code is expected required to be reentrant and state-less, 543 but due to the extensive use of 3rd party libraries this cannot 544 be guaranteed. 545 546 547 @deftp {C Struct} EXTRACTOR_PluginList 548 @tindex struct EXTRACTOR_PluginList 549 550 A plugin list represents a set of GNU libextractor plugins. Most of 551 the GNU libextractor API is concerned with either constructing a 552 plugin list or using it to extract meta data. The internal representation 553 of the plugin list is of no concern to users or plugin developers. 554 @end deftp 555 556 557 @deftypefun void EXTRACTOR_plugin_remove_all (struct EXTRACTOR_PluginList *plugins) 558 @findex EXTRACTOR_plugin_remove_all 559 560 Unload all of the plugins in the given list. 561 @end deftypefun 562 563 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_remove (struct EXTRACTOR_PluginList *plugins, const char*name) 564 @findex EXTRACTOR_plugin_remove 565 566 Unloads a particular plugin. The given name should be the short name of the plugin, for example ``mime'' for the mime-type extractor or ``mpeg'' for the MPEG extractor. 567 @end deftypefun 568 569 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add (struct EXTRACTOR_PluginList *plugins, const char* name,const char* options, enum EXTRACTOR_Options flags) 570 @findex EXTRACTOR_plugin_add 571 572 Loads a particular plugin. The plugin is added to the existing list, which can be @code{NULL}. The second argument specifies the name of the plugin (i.e. ``ogg''). The third argument can be @code{NULL} and specifies plugin-specific options. Finally, the last argument specifies if the plugin should be executed out-of-process (@code{EXTRACTOR_OPTION_DEFAULT_POLICY}) or not. 573 @end deftypefun 574 575 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_config (struct EXTRACTOR_PluginList *plugins, const char* config, enum EXTRACTOR_Options flags) 576 @findex EXTRACTOR_plugin_add_config 577 578 Loads and unloads plugins based on a configuration string, modifying the existing list, which can be @code{NULL}. The string has the format ``[-]NAME(OPTIONS)@{:[-]NAME(OPTIONS)@}*''. Prefixing the plugin name with a ``-'' means that the plugin should be unloaded. 579 @end deftypefun 580 581 @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags) 582 @findex EXTRACTOR_plugin_add_defaults 583 584 Loads all of the plugins in the plugin directory. This function is what most GNU libextractor applications should use to setup the plugins. 585 @end deftypefun 586 587 588 589 @node Meta types 590 @section Meta types 591 592 593 @tindex enum EXTRACTOR_MetaType 594 @findex EXTRACTOR_metatype_get_max 595 596 @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data. The total number can differ between different GNU libextractor releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function. All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}. 597 598 @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type) 599 @findex EXTRACTOR_metatype_to_string 600 @cindex gettext 601 @cindex internationalization 602 603 The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type. The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}). 604 @end deftypefun 605 606 @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type) 607 @findex EXTRACTOR_metatype_to_description 608 @cindex gettext 609 @cindex internationalization 610 611 The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type. The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive. The string can be translated into other languages using GNU gettext with the domain set to GNU libextractor (@verb{|dgettext("libextractor", s)|}). 612 @end deftypefun 613 614 615 616 @node Meta formats 617 @section Meta formats 618 619 @tindex enum EXTRACTOR_MetaFormat 620 621 @verb{|enum EXTRACTOR_MetaFormat|} is a C enum which defines on a high level how the extracted meta data is represented. Currently, the library uses three formats: UTF-8 strings, C strings and binary data. A fourth value, @code{EXTRACTOR_METAFORMAT_UNKNOWN} is defined but not used. UTF-8 strings are 0-terminated strings that have been converted to UTF-8. The format code is @code{EXTRACTOR_METAFORMAT_UTF8}. Ideally, most text meta data will be of this format. Some file formats fail to specify the encoding used for the text. In this case, the text cannot be converted to UTF-8. However, the meta data is still known to be 0-terminated and presumably human-readable. In this case, the format code used is @code{EXTRACTOR_METAFORMAT_C_STRING}; however, this should not be understood to mean that the encoding is the same as that used by the C compiler. Finally, for binary data (mostly images), the format @code{EXTRACTOR_METAFORMAT_BINARY} is used. 622 623 Naturally this is not a precise description of the meta format. Plugins can provide a more precise description (if known) by providing the respective mime type of the meta data. For example, binary image meta data could be also tagged as ``image/png'' and normal text would typically be tagged as ``text/plain''. 624 625 626 627 @node Extracting 628 @section Extracting 629 630 @deftypefn {Function Pointer} int (*EXTRACTOR_MetaDataProcessor)(void *cls, const char *plugin_name, enum EXTRACTOR_MetaType type, enum EXTRACTOR_MetaFormat format, const char *data_mime_type, const char *data, size_t data_len) 631 @tindex EXTRACTOR_MetaDataProcessor 632 633 Type of a function that libextractor calls for each meta data item found. 634 635 @table @var 636 637 @item cls 638 closure (user-defined) 639 640 @item plugin_name 641 name of the plugin that produced this value; special values can be used (i.e. '<zlib>' for zlib being used in the main libextractor library and yielding meta data); 642 643 @item type 644 libextractor-type describing the meta data; 645 646 @item format basic 647 format information about data 648 649 @item data_mime_type 650 mime-type of data (not of the original file); can be @code{NULL} (if mime-type is not known); 651 652 @item data 653 actual meta-data found 654 655 @item data_len 656 number of bytes in data 657 658 @end table 659 660 Return 0 to continue extracting, 1 to abort. 661 @end deftypefn 662 663 664 665 @deftypefun void EXTRACTOR_extract (struct EXTRACTOR_PluginList *plugins, const char *filename, const void *data, size_t size, EXTRACTOR_MetaDataProcessor proc, void *proc_cls) 666 @findex EXTRACTOR_extract 667 @cindex reentrant 668 @cindex concurrency 669 @cindex threads 670 @cindex thread-safety 671 672 This is the main function for extracting keywords with GNU libextractor. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The @samp{filename} argument is optional and can be used to specify the name of a file to process. If @samp{filename} is @code{NULL}, then the @samp{data} argument must point to the in-memory data to extract meta data from. If @samp{filename} is non-@code{NULL}, @samp{data} can be @code{NULL}. If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes. Otherwise @samp{size} should be zero. For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}. The other arguments to @samp{proc} depend on the specific meta data found. 673 674 @cindex SIGBUS 675 @cindex bus error 676 Meta data extraction should never really fail --- at worst, GNU libextractor should not call @samp{proc} with any meta data. By design, GNU libextractor should never crash or leak memory, even given corrupt files as input. Note however, that running GNU libextractor on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process. As GNU libextractor typically runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. GNU libextractor will @emph{not} attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running GNU libextractor itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}). 677 678 @end deftypefun 679 680 681 @node Language bindings 682 @chapter Language bindings 683 @cindex Java 684 @cindex Mono 685 @cindex Perl 686 @cindex Python 687 @cindex PHP 688 @cindex Ruby 689 690 GNU libextractor works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main GNU libextractor website. Documentation for these bindings (if available) is part of the downloads for the respective binding. In all cases, a full installation of the C library is required before the binding can be installed. 691 692 @section Java 693 694 Compiling the GNU libextractor Java binding follows the usual process of 695 running @command{configure} and @command{make}. The result will be a 696 shared C library @file{libextractor_java.so} with the native code and 697 a JAR file (installed to @file{$PREFIX/share/java/libextractor.java}). 698 699 A minimal example for using GNU libextractor's Java binding would look 700 like this: 701 @verbatim 702 import org.gnu.libextractor.*; 703 import java.util.ArrayList; 704 705 public static void main(String[] args) { 706 Extractor ex = Extractor.getDefault(); 707 for (int i=0;i<args.length;i++) { 708 ArrayList keywords = ex.extract(args[i]); 709 System.out.println("Keywords for " + args[i] + ":"); 710 for (int j=0;j<keywords.size();j++) 711 System.out.println(keywords.get(j)); 712 } 713 } 714 @end verbatim 715 716 The GNU libextractor library and the @file{libextractor_java.so} JNI binding 717 have to be in the library search path for this to work. Furthermore, the 718 @file{libextractor.jar} file should be on the classpath. 719 720 Note that the API does not use Java 5 style generics in order to work 721 with older versions of Java. 722 723 @section Mono 724 725 his binding is undocumented at this point. 726 727 @section Perl 728 729 This binding is undocumented at this point. 730 731 @section Python 732 733 This binding is undocumented at this point. 734 735 @section PHP 736 737 This binding is undocumented at this point. 738 739 @section Ruby 740 741 This binding is undocumented at this point. 742 743 744 745 @node Utility functions 746 @chapter Utility functions 747 748 @cindex reentrant 749 @cindex concurrency 750 @cindex threads 751 @cindex thread-safety 752 This chapter describes various utility functions for GNU libextractor usage. All of the functions are reentrant. 753 754 @menu 755 * Utility Constants:: 756 * Meta data printing:: 757 @end menu 758 759 @node Utility Constants 760 @section Utility Constants 761 762 @findex EXTRACTOR_VERSION 763 The constant @verb{|EXTRACTOR_VERSION|} is a hexadecimal 764 representation of the version number of the installed libextractor 765 header. The hexadecimal format is 0xAABBCCDD where AA is the major 766 version (so far always 0), BB is the minor version, CC is the revision 767 and DD the patch number. For example, for version 0.5.18, we would 768 have AA=0, BB=5, CC=18 and DD=0. Minor releases such as 0.5.18a or 769 significant changes in unreleased versions would be marked with DD=1 770 or higher. 771 772 773 @node Meta data printing 774 @section Meta data printing 775 776 777 @findex EXTRACTOR_meta_data_print 778 The @verb{|EXTRACTOR_meta_data_print|} is a simple function which prints the meta data found with libextractor to a file. The function is mostly useful for debugging and as an example for how to manipulate the keyword list and can be passed as the @samp{proc} argument to @code{EXTRACTOR_extract}. The file to print to should be passed as @samp{proc_cls} (which must be of type @code{FILE *}), for example @code{stdout}. 779 780 781 782 @node Existing Plugins 783 @chapter Existing Plugins 784 785 @itemize @bullet 786 @item 787 ARCHIVE (using libarchive) 788 @item 789 DVI 790 @item 791 EXIV2 (using libexiv2, 0.23 or later preferred) 792 @item 793 FLAC (using libFLAC) 794 @item 795 GIF (using libgif) 796 @item 797 GSTREAMER (using libgstreamer v1.0 or later) 798 @item 799 HTML (using libtidy) 800 @item 801 IT 802 @item 803 JPEG (using libjpeg v8 or later) 804 @item 805 MAN 806 @item 807 MIDI (using libsmf) 808 @item 809 MIME (using libmagic) 810 @item 811 MPEG (using libmpeg2) 812 @item 813 NSF 814 @item 815 NSFE 816 @item 817 ODF 818 @item 819 OLE2 (with libgsf) 820 @item 821 OGG (with libogg) 822 @item 823 PNG 824 @item 825 PS 826 @item 827 RIFF 828 @item 829 RPM (using librpm) 830 @item 831 S3M 832 @item 833 SID 834 @item 835 ThumbnailFFMPEG (using libavformat and related libav-libraries, including libswscale) 836 @item 837 ThumbnailGtk (using libgtk) 838 @item 839 TIFF (with libtiff, tested with v4) 840 @item 841 WAV 842 @item 843 XM 844 @item 845 ZIP 846 @end itemize 847 848 @file{gzip} and @file{bzip2} compressed versions of these formats are 849 also supported (as well as meta data embedded by @file{gzip} itself) 850 if zlib or libbz2 are available. 851 852 @node Writing new Plugins 853 @chapter Writing new Plugins 854 855 Writing a new plugin for libextractor usually requires writing of or 856 interfacing with an actual parser for a specific format. How this is 857 can be accomplished depends on the format and cannot be specified in 858 general. However, care should be taken for the code to be reentrant 859 and highly fault-tolerant, especially with respect to malformed 860 inputs. 861 862 Plugins should start by verifying that the header of the data matches 863 the specific format and immediately return if that is not the case. 864 Even if the header matches the expected file format, plugins must not 865 assume that the remainder of the file is well formed. 866 867 The plugin library must be called libextractor_XXX.so, where XXX 868 denotes the file format of the plugin. The library must export a 869 method @verb{|libextractor_XXX_extract_method|}, with the following 870 signature: 871 @verbatim 872 void 873 EXTRACTOR_XXX_extract_method (struct EXTRACTOR_ExtractContext *ec); 874 @end verbatim 875 876 @samp{ec} contains various information the plugin may need for its 877 execution. Most importantly, it contains functions for reading 878 (``read'') and seeking (``seek'') the input data and for returning 879 extracted data (``proc''). The ``config'' member can contain 880 additional configuration options. ``proc'' should be called on 881 each meta data item found. If ``proc'' returns non-zero, 882 processing should be aborted (if possible). 883 884 In order to test new plugins, the @file{extract} command can be run 885 with the options ``-ni'' and ``-l XXX'' . This will run the plugin 886 in-process (making it easier to debug) and without any of the other 887 plugins. 888 889 890 @section Example for a minimal extract method 891 892 The following example shows how a plugin can return the mime type of 893 a file. 894 @example 895 @verbatim 896 void 897 EXTRACTOR_mymime_extract (struct EXTRACTOR_ExtractContext *ec) 898 { 899 void *data; 900 ssize_t data_size, 901 902 if (-1 == (data_size = ec->read (ec->cls, &data, 4))) 903 return; /* read error */ 904 if (data_size < 4) 905 return; /* file too small */ 906 if (0 != memcmp (data, "\177ELF", 4)) 907 return; /* not ELF */ 908 if (0 != ec->proc (ec->cls, 909 "mymime", 910 EXTRACTOR_METATYPE_MIMETYPE, 911 EXTRACTOR_METAFORMAT_UTF8, 912 "text/plain", 913 "application/x-executable", 914 1 + strlen("application/x-executable"))) 915 return; 916 /* more calls to 'proc' here as needed */ 917 } 918 @end verbatim 919 @end example 920 921 922 @node Internal utility functions 923 @chapter Internal utility functions 924 925 Some plugins link against the @code{libextractor_common} library which 926 provides common abstractions needed by many plugins. This section 927 documents this internal API for plugin developers. Note that the headers 928 for this library are (intentionally) not installed: we do not consider 929 this API stable and it should hence only be used by plugins that are 930 build and shipped with GNU libextractor. Third-party plugins should 931 not use it. 932 933 @file{convert_numeric.h} defines various conversion functions for 934 numbers (in particular, byte-order conversion for floating point 935 numbers). 936 937 @file{unzip.h} defines an API for accessing compressed files. 938 939 @file{pack.h} provides an interpreter for unpacking structs of integer 940 numbers from streams and converting from big or little endian to host 941 byte order at the same time. 942 943 @file{convert.h} provides a function for character set conversion described 944 below. 945 946 @deftypefun {char *} EXTRACTOR_common_convert_to_utf8 (const char *input, size_t len, const char *charset) 947 @cindex UTF-8 948 @cindex character set 949 @findex EXTRACTOR_common_convert_to_utf8 950 Various GNU libextractor plugins make use of the internal 951 @file{convert.h} header which defines a function 952 953 @verb{|EXTRACTOR_common_convert_to_utf8|} which can be used to easily convert text from 954 any character set to UTF-8. This conversion is important since the 955 linked list of keywords that is returned by GNU libextractor is 956 expected to contain only UTF-8 strings. Naturally, proper conversion 957 may not always be possible since some file formats fail to specify the 958 character set. In that case, it is often better to not convert at 959 all. 960 961 The arguments to @verb{|EXTRACTOR_common_convert_to_utf8|} are the input string (which 962 does @emph{not} have to be zero-terminated), the length of the input 963 string, and the character set (which @emph{must} be zero-terminated). 964 Which character sets are supported depends on the platform, a list can 965 generally be obtained using the @command{iconv -l} command. The 966 return value from @verb{|EXTRACTOR_common_convert_to_utf8|} is a zero-terminated string 967 in UTF-8 format. The responsibility to free the string is with the 968 caller, so storing the string in the keyword list is acceptable. 969 @end deftypefun 970 971 972 973 974 975 @node Reporting bugs 976 @chapter Reporting bugs 977 978 @cindex bug 979 GNU libextractor uses the @url{https://gnunet.org/bugs/,Mantis bugtracking 980 system}. If possible, please report bugs there. You can also e-mail 981 the GNU libextractor mailinglist at @url{libextractor@@gnu.org}. 982 983 984 985 @c ********************************************************** 986 @c ******************* Appendices ************************* 987 @c ********************************************************** 988 989 @node GNU Free Documentation License 990 @appendix GNU Free Documentation License 991 992 @include fdl-1.3.texi 993 994 995 @node Index 996 @unnumbered Index 997 998 @printindex cp 999 1000 @c @node Function and Data Index 1001 @c @unnumbered Function and Data Index 1002 @c @printindex fn 1003 1004 @c @node Type Index 1005 @c @unnumbered Type Index 1006 @c @printindex tp 1007 1008 @bye