diff options
Diffstat (limited to 'doc/extractor.texi')
-rw-r--r-- | doc/extractor.texi | 212 |
1 files changed, 164 insertions, 48 deletions
diff --git a/doc/extractor.texi b/doc/extractor.texi index d382aed..4bf6743 100644 --- a/doc/extractor.texi +++ b/doc/extractor.texi | |||
@@ -10,8 +10,10 @@ | |||
10 | @c %**end of header | 10 | @c %**end of header |
11 | @copying | 11 | @copying |
12 | This manual is for GNU libextractor | 12 | This manual is for GNU libextractor |
13 | (version @value{VERSION}, @value{UPDATED}), | 13 | (version @value{VERSION}, @value{UPDATED}). |
14 | which is GNU's library for meta data extraction. | 14 | |
15 | GNU libextractor is a GNU package. | ||
16 | |||
15 | 17 | ||
16 | Copyright @copyright{} 2007, 2010 Christian Grothoff | 18 | Copyright @copyright{} 2007, 2010 Christian Grothoff |
17 | 19 | ||
@@ -73,7 +75,7 @@ Free Documentation License". | |||
73 | @code{NULL} | 75 | @code{NULL} |
74 | @end macro | 76 | @end macro |
75 | 77 | ||
76 | @macro le{} | 78 | @macro gnule{} |
77 | @acronym{GNU libextractor} | 79 | @acronym{GNU libextractor} |
78 | @end macro | 80 | @end macro |
79 | 81 | ||
@@ -84,24 +86,22 @@ Free Documentation License". | |||
84 | @insertcopying | 86 | @insertcopying |
85 | @end ifnottex | 87 | @end ifnottex |
86 | 88 | ||
87 | GNU libextractor is a GNU package. | ||
88 | |||
89 | @menu | 89 | @menu |
90 | * Introduction:: What is @le{}. | 90 | * Introduction:: What is @gnule{}. |
91 | * Preparation:: What you should do before using the library. | 91 | * Preparation:: What you should do before using the library. |
92 | * Generalities:: General library functions and data types. | 92 | * Generalities:: General library functions and data types. |
93 | * Extracting meta data:: How to use @le{} to obtain meta data. | 93 | * Extracting meta data:: How to use @gnule{} to obtain meta data. |
94 | * Language bindings:: How to use @le{} from languages other than C. | 94 | * Language bindings:: How to use @gnule{} from languages other than C. |
95 | * Utility functions:: Utility functions of @le{}. | 95 | * Utility functions:: Utility functions of @gnule{}. |
96 | * Existing Plugins:: What plugins are available. | 96 | * Existing Plugins:: What plugins are available. |
97 | * Writing new Plugins:: How to write new plugins for @le{}. | 97 | * Writing new Plugins:: How to write new plugins for @gnule{}. |
98 | * Internal utility functions:: Utility functions of @le{} for writing plugins. | 98 | * Internal utility functions:: Utility functions of @gnule{} for writing plugins. |
99 | * Reporting bugs:: How to report bugs or request new features. | 99 | * Reporting bugs:: How to report bugs or request new features. |
100 | 100 | ||
101 | Appendices | 101 | Appendices |
102 | 102 | ||
103 | * Copying:: The GNU General Public License says how you | 103 | * Copying:: The GNU General Public License says how you |
104 | can copy and share some parts of @le{}. | 104 | can copy and share some parts of @gnule{}. |
105 | 105 | ||
106 | Indices | 106 | Indices |
107 | 107 | ||
@@ -120,7 +120,7 @@ Indices | |||
120 | @chapter Introduction | 120 | @chapter Introduction |
121 | 121 | ||
122 | @cindex error handling | 122 | @cindex error handling |
123 | @le{} is GNU's library for extracting meta data from | 123 | @gnule{} is GNU's library for extracting meta data from |
124 | files. Meta data includes format information (such as mime type, | 124 | files. Meta data includes format information (such as mime type, |
125 | image dimensions, color depth, recording frequency), content | 125 | image dimensions, color depth, recording frequency), content |
126 | descriptions (such as document title or document description) and | 126 | descriptions (such as document title or document description) and |
@@ -128,55 +128,55 @@ copyright information (such as license, author and contributors). | |||
128 | Meta data extraction is an inherently uncertain business --- a parse | 128 | Meta data extraction is an inherently uncertain business --- a parse |
129 | error can be a corrupt file, an incompatibility in the file format | 129 | error can be a corrupt file, an incompatibility in the file format |
130 | version, an entirely different file format or a bug in the parser. As | 130 | version, an entirely different file format or a bug in the parser. As |
131 | a result of this uncertainty, @le{} deliberately | 131 | a result of this uncertainty, @gnule{} deliberately |
132 | avoids to ever report any errors. Unexpected file contents simply | 132 | avoids to ever report any errors. Unexpected file contents simply |
133 | result in less or possibly no meta data being extracted. | 133 | result in less or possibly no meta data being extracted. |
134 | 134 | ||
135 | @cindex plugin | 135 | @cindex plugin |
136 | @le{} uses plugins to handle various file formats. | 136 | @gnule{} uses plugins to handle various file formats. |
137 | Technically a plugin can support multiple file formats; however, most | 137 | Technically a plugin can support multiple file formats; however, most |
138 | plugins only support one particular format. By default, | 138 | plugins only support one particular format. By default, |
139 | @le{} will use all plugins that are available and found | 139 | @gnule{} will use all plugins that are available and found |
140 | in the plugin installation directory. Applications can | 140 | in the plugin installation directory. Applications can |
141 | request the use of only specific plugins or the exclusion of | 141 | request the use of only specific plugins or the exclusion of |
142 | certain plugins. | 142 | certain plugins. |
143 | 143 | ||
144 | @le{} is distributed with the @command{extract} | 144 | @gnule{} is distributed with the @command{extract} |
145 | command@footnote{Some distributions ship @command{extract} in a | 145 | command@footnote{Some distributions ship @command{extract} in a |
146 | seperate package.} which is a command-line tool for extracting | 146 | seperate package.} which is a command-line tool for extracting |
147 | meta data. @command{extract} is given a list of filenames and | 147 | meta data. @command{extract} is given a list of filenames and |
148 | prints the resulting meta data to the console. The @command{extract} | 148 | prints the resulting meta data to the console. The @command{extract} |
149 | source code also serves as an advanced example for how to use | 149 | source code also serves as an advanced example for how to use |
150 | @le{}. | 150 | @gnule{}. |
151 | 151 | ||
152 | This manual focuses on providing documentation for writing software | 152 | This manual focuses on providing documentation for writing software |
153 | with @le{}. The only relevant parts for end-users | 153 | with @gnule{}. The only relevant parts for end-users |
154 | are the chapter on compiling and installing @le{} | 154 | are the chapter on compiling and installing @gnule{} |
155 | (@xref{Preparation}.). Also, the chapter on existing plugins maybe of | 155 | (@xref{Preparation}.). Also, the chapter on existing plugins maybe of |
156 | interest (@xref{Existing Plugins}.). Additional documentation for | 156 | interest (@xref{Existing Plugins}.). Additional documentation for |
157 | end-users can be find in the man page on @command{extract} (using | 157 | end-users can be find in the man page on @command{extract} (using |
158 | @verb{|man extract|}). | 158 | @verb{|man extract|}). |
159 | 159 | ||
160 | @cindex license | 160 | @cindex license |
161 | @le{} is licensed under the GNU General Public License. The | 161 | @gnule{} is licensed under the GNU General Public License. The |
162 | developers have frequently received requests to license GNU | 162 | developers have frequently received requests to license GNU |
163 | libextractor under alternative terms. However, @le{} | 163 | libextractor under alternative terms. However, @gnule{} |
164 | borrows plenty of GPL-licensed code from various other projects. | 164 | borrows plenty of GPL-licensed code from various other projects. |
165 | Hence we cannot change the license (even if we wanted to).@footnote{It | 165 | Hence we cannot change the license (even if we wanted to).@footnote{It |
166 | maybe possible to switch to GPLv3 in the future. For this, an audit | 166 | maybe possible to switch to GPLv3 in the future. For this, an audit |
167 | of the license status of our dependencies would be required. The new | 167 | of the license status of our dependencies would be required. The new |
168 | code that was developed specifically for @le{} has | 168 | code that was developed specifically for @gnule{} has |
169 | always been licensed under GPLv2 @emph{or any later version}.} | 169 | always been licensed under GPLv2 @emph{or any later version}.} |
170 | 170 | ||
171 | @node Preparation | 171 | @node Preparation |
172 | @chapter Preparation | 172 | @chapter Preparation |
173 | 173 | ||
174 | Compiling @le{} follows the standard GNU autotools | 174 | Compiling @gnule{} follows the standard GNU autotools |
175 | build process using @command{configure} and @command{make}. For | 175 | build process using @command{configure} and @command{make}. For |
176 | details, read the @file{INSTALL} file and query | 176 | details, read the @file{INSTALL} file and query |
177 | @verb{|./configure --help|} for additional options. | 177 | @verb{|./configure --help|} for additional options. |
178 | 178 | ||
179 | @le{} has various dependencies, some of which are optional. | 179 | @gnule{} has various dependencies, some of which are optional. |
180 | Instead of specifying the names of the software packages, we | 180 | Instead of specifying the names of the software packages, we |
181 | will give the list in terms of the names of the respective | 181 | will give the list in terms of the names of the respective |
182 | Debian (unstable) packages that should be installed. | 182 | Debian (unstable) packages that should be installed. |
@@ -246,29 +246,29 @@ Please notify us if we missed some dependencies (note that the list is | |||
246 | supposed to only list direct dependencies, not transitive | 246 | supposed to only list direct dependencies, not transitive |
247 | dependencies). | 247 | dependencies). |
248 | 248 | ||
249 | Once you have compiled and installed @le{}, you should have a file | 249 | Once you have compiled and installed @gnule{}, you should have a file |
250 | @file{extractor.h} installed in your @file{include/} directory. This | 250 | @file{extractor.h} installed in your @file{include/} directory. This |
251 | file should be the starting point for your C and C++ development with | 251 | file should be the starting point for your C and C++ development with |
252 | @le{}. The build process also installs the @file{extract} binary and | 252 | @gnule{}. The build process also installs the @file{extract} binary and |
253 | man pages for @file{extract} and @le{}. The @file{extract} man page | 253 | man pages for @file{extract} and @gnule{}. The @file{extract} man page |
254 | documents the @file{extract} tool. The @le{} man page gives a brief | 254 | documents the @file{extract} tool. The @gnule{} man page gives a brief |
255 | summary of the C API for @le{}. | 255 | summary of the C API for @gnule{}. |
256 | 256 | ||
257 | @cindex packageing | 257 | @cindex packageing |
258 | @cindex directory structure | 258 | @cindex directory structure |
259 | @cindex plugin | 259 | @cindex plugin |
260 | @cindex environment variables | 260 | @cindex environment variables |
261 | @vindex LIBEXTRACTOR_PREFIX | 261 | @vindex LIBEXTRACTOR_PREFIX |
262 | When you install @le{}, various plugins will be | 262 | When you install @gnule{}, various plugins will be |
263 | installed in the @file{lib/libextractor/} directory. The main library | 263 | installed in the @file{lib/libextractor/} directory. The main library |
264 | will be installed as @file{lib/libextractor.so}. Note that | 264 | will be installed as @file{lib/libextractor.so}. Note that |
265 | @le{} will attempt to find the plugins relative to the | 265 | @gnule{} will attempt to find the plugins relative to the |
266 | path of the main library. Consequently, a package manager can move | 266 | path of the main library. Consequently, a package manager can move |
267 | the library and its plugins to a different location later --- as long | 267 | the library and its plugins to a different location later --- as long |
268 | as the relative path between the main library and the plugins is | 268 | as the relative path between the main library and the plugins is |
269 | preserved. As a method of last resort, the user can specify an | 269 | preserved. As a method of last resort, the user can specify an |
270 | environment variable @verb{|LIBEXTRACTOR_PREFIX|}. If | 270 | environment variable @verb{|LIBEXTRACTOR_PREFIX|}. If |
271 | @le{} cannot locate a plugin, it will look in | 271 | @gnule{} cannot locate a plugin, it will look in |
272 | @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}. | 272 | @verb{|LIBEXTRACTOR_PREFIX/lib/libextractor/|}. |
273 | 273 | ||
274 | @section Note to package maintainers | 274 | @section Note to package maintainers |
@@ -304,9 +304,9 @@ resources. | |||
304 | @node Generalities | 304 | @node Generalities |
305 | @chapter Generalities | 305 | @chapter Generalities |
306 | 306 | ||
307 | Each public symbol exported by @le{} has the prefix | 307 | Each public symbol exported by @gnule{} has the prefix |
308 | @verb{|EXTRACTOR_|}. All-caps names are used for constants. For the | 308 | @verb{|EXTRACTOR_|}. All-caps names are used for constants. For the |
309 | impatient, the minimal C code for using @le{} (on the | 309 | impatient, the minimal C code for using @gnule{} (on the |
310 | executing binary itself) looks like this: | 310 | executing binary itself) looks like this: |
311 | 311 | ||
312 | @verbatim | 312 | @verbatim |
@@ -326,6 +326,13 @@ int main(int argc, char ** argv) { | |||
326 | @node Extracting meta data | 326 | @node Extracting meta data |
327 | @chapter Extracting meta data | 327 | @chapter Extracting meta data |
328 | 328 | ||
329 | In order to extract meta data with @gnule{} you first need to | ||
330 | load the respective plugins and then call the extraction API | ||
331 | with the plugins and the data to process. This section | ||
332 | documents how to load and unload plugins, the various types | ||
333 | and formats in which meta data is returned to the application | ||
334 | and finally the extraction API itself. | ||
335 | |||
329 | @menu | 336 | @menu |
330 | * Plugin management:: How to load and unload plugins | 337 | * Plugin management:: How to load and unload plugins |
331 | * Meta types:: About meta types | 338 | * Meta types:: About meta types |
@@ -350,7 +357,7 @@ from multiple threads at the same time is not safe. Creating multiple | |||
350 | plugin lists and using them concurrently is supported as long as | 357 | plugin lists and using them concurrently is supported as long as |
351 | the @code{EXTRACTOR_OPTION_IN_PROCESS} option is not used. | 358 | the @code{EXTRACTOR_OPTION_IN_PROCESS} option is not used. |
352 | 359 | ||
353 | Generally, @le{} is fully thread-safe and mostly reentrant. | 360 | Generally, @gnule{} is fully thread-safe and mostly reentrant. |
354 | All plugin code is expected required to be reentrant and state-less, | 361 | All plugin code is expected required to be reentrant and state-less, |
355 | but due to the extensive use of 3rd party libraries this cannot | 362 | but due to the extensive use of 3rd party libraries this cannot |
356 | be guaranteed. Hence plugins are executed (by default) out of | 363 | be guaranteed. Hence plugins are executed (by default) out of |
@@ -402,7 +409,7 @@ Loads and unloads plugins based on a configuration string, modifying the existin | |||
402 | @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags) | 409 | @deftypefun {struct EXTRACTOR_PluginList *} EXTRACTOR_plugin_add_defaults (enum EXTRACTOR_Options flags) |
403 | @findex EXTRACTOR_plugin_add_defaults | 410 | @findex EXTRACTOR_plugin_add_defaults |
404 | 411 | ||
405 | Loads all of the plugins in the plugin directory. This function is what most @le{} applications should use to setup the plugins. | 412 | Loads all of the plugins in the plugin directory. This function is what most @gnule{} applications should use to setup the plugins. |
406 | @end deftypefun | 413 | @end deftypefun |
407 | 414 | ||
408 | 415 | ||
@@ -414,14 +421,14 @@ Loads all of the plugins in the plugin directory. This function is what most @l | |||
414 | @tindex enum EXTRACTOR_MetaType | 421 | @tindex enum EXTRACTOR_MetaType |
415 | @findex EXTRACTOR_metatype_get_max | 422 | @findex EXTRACTOR_metatype_get_max |
416 | 423 | ||
417 | @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data. The total number can differ between different @le{} releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function. All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}. | 424 | @verb{|enum EXTRACTOR_MetaType|} is a C enum which defines a list of over 100 different types of meta data. The total number can differ between different @gnule{} releases; the maximum value for the current release can be obtained using the @verb{|EXTRACTOR_metatype_get_max|} function. All values in this enumeration are of the form @verb{|EXTRACTOR_METATYPE_XXX|}. |
418 | 425 | ||
419 | @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type) | 426 | @deftypefun {const char *} EXTRACTOR_metatype_to_string (enum EXTRACTOR_MetaType type) |
420 | @findex EXTRACTOR_metatype_to_string | 427 | @findex EXTRACTOR_metatype_to_string |
421 | @cindex gettext | 428 | @cindex gettext |
422 | @cindex internationalization | 429 | @cindex internationalization |
423 | 430 | ||
424 | The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type. The string can be translated into other languages using GNU gettext with the domain set to @le{} (@verb{|dgettext("libextractor", s)|}). | 431 | The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short English string @samp{s} describing the meta data type. The string can be translated into other languages using GNU gettext with the domain set to @gnule{} (@verb{|dgettext("libextractor", s)|}). |
425 | @end deftypefun | 432 | @end deftypefun |
426 | 433 | ||
427 | @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type) | 434 | @deftypefun {const char *} EXTRACTOR_metatype_to_description (enum EXTRACTOR_MetaType type) |
@@ -429,7 +436,7 @@ The function @verb{|EXTRACTOR_metatype_to_string|} can be used to obtain a short | |||
429 | @cindex gettext | 436 | @cindex gettext |
430 | @cindex internationalization | 437 | @cindex internationalization |
431 | 438 | ||
432 | The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type. The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive. The string can be translated into other languages using GNU gettext with the domain set to @le{} (@verb{|dgettext("libextractor", s)|}). | 439 | The function @verb{|EXTRACTOR_metatype_to_description|} can be used to obtain a longer English string @samp{s} describing the meta data type. The description may be empty if the short description returned by @code{EXTRACTOR_metatype_to_string} is already comprehensive. The string can be translated into other languages using GNU gettext with the domain set to @gnule{} (@verb{|dgettext("libextractor", s)|}). |
433 | @end deftypefun | 440 | @end deftypefun |
434 | 441 | ||
435 | 442 | ||
@@ -490,11 +497,11 @@ Return 0 to continue extracting, 1 to abort. | |||
490 | @cindex threads | 497 | @cindex threads |
491 | @cindex thread-safety | 498 | @cindex thread-safety |
492 | 499 | ||
493 | This is the main function for extracting keywords with @le{}. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The @samp{filename} argument is optional and can be used to specify the name of a file to process. If @samp{filename} is NULL, then the @samp{data} argument must point to the in-memory data to extract meta data from. If @samp{filename} is non-NULL, @samp{data} can be NULL. If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes. Otherwise @samp{size} should be zero. For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}. The other arguments to @samp{proc} depend on the specific meta data found. | 500 | This is the main function for extracting keywords with @gnule{}. The first argument is a plugin list which specifies the set of plugins that should be used for extracting meta data. The @samp{filename} argument is optional and can be used to specify the name of a file to process. If @samp{filename} is NULL, then the @samp{data} argument must point to the in-memory data to extract meta data from. If @samp{filename} is non-NULL, @samp{data} can be NULL. If @samp{data} is non-null, then @samp{size} is the size of @samp{data} in bytes. Otherwise @samp{size} should be zero. For each meta data item found, GNU libextractor will call the @samp{proc} function, passing @samp{proc_cls} as the first argument to @samp{proc}. The other arguments to @samp{proc} depend on the specific meta data found. |
494 | 501 | ||
495 | @cindex SIGBUS | 502 | @cindex SIGBUS |
496 | @cindex bus error | 503 | @cindex bus error |
497 | Meta data extraction should never really fail --- at worst, @le{} should not call @samp{proc} with any meta data. By design, @le{} should never crash or leak memory, even given corrupt files as input. Note however, that running @le{} on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process. While @le{} runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. @le{} will @emph{not} attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running @le{} itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}). | 504 | Meta data extraction should never really fail --- at worst, @gnule{} should not call @samp{proc} with any meta data. By design, @gnule{} should never crash or leak memory, even given corrupt files as input. Note however, that running @gnule{} on a corrupt file system (or incorrectly @verb{|mmap|}ed files) can result in the operating system sending a SIGBUS (bus error) to the process. While @gnule{} runs plugins out-of-process, it first maps the file into memory and then attempts to decompress it. During decompression it is possible to encounter a SIGBUS. @gnule{} will @emph{not} attempt to catch this signal and your application is likely to crash. Note again that this should only happen if the file @emph{system} is corrupt (not if individual files are corrupt). If this is not acceptable, you might want to consider running @gnule{} itself also out-of-process (as done, for example, by @url{http://grothoff.org/christian/doodle/,doodle}). |
498 | 505 | ||
499 | @end deftypefun | 506 | @end deftypefun |
500 | 507 | ||
@@ -509,7 +516,7 @@ Meta data extraction should never really fail --- at worst, @le{} should not cal | |||
509 | @cindex PHP | 516 | @cindex PHP |
510 | @cindex Ruby | 517 | @cindex Ruby |
511 | 518 | ||
512 | @le{} works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main @le{} website. Documentation for these bindings (if available) is part of the downloads for the respective binding. In all cases, a full installation of the C library is required before the binding can be installed. | 519 | @gnule{} works immediately with C and C++ code. Bindings for Java, Mono, Ruby, Perl, PHP and Python are available for download from the main @gnule{} website. Documentation for these bindings (if available) is part of the downloads for the respective binding. In all cases, a full installation of the C library is required before the binding can be installed. |
513 | 520 | ||
514 | @section Java | 521 | @section Java |
515 | 522 | ||
@@ -571,7 +578,7 @@ This binding is undocumented at this point. | |||
571 | @cindex concurrency | 578 | @cindex concurrency |
572 | @cindex threads | 579 | @cindex threads |
573 | @cindex thread-safety | 580 | @cindex thread-safety |
574 | This chapter describes various utility functions for @le{} usage. All of the functions are reentrant. | 581 | This chapter describes various utility functions for @gnule{} usage. All of the functions are reentrant. |
575 | 582 | ||
576 | @menu | 583 | @menu |
577 | * Utility Constants:: | 584 | * Utility Constants:: |
@@ -724,6 +731,115 @@ in-process (making it easier to debug) and without any of the other | |||
724 | plugins. | 731 | plugins. |
725 | 732 | ||
726 | 733 | ||
734 | @section Example for a minimal extract method | ||
735 | |||
736 | The following example shows how a plugin can return the mime type of | ||
737 | a file. | ||
738 | @example | ||
739 | |||
740 | int | ||
741 | EXTRACTOR_mymime_extract | ||
742 | (const char *data, | ||
743 | size_t data_size, | ||
744 | EXTRACTOR_MetaDataProcessor proc, | ||
745 | void *proc_cls, | ||
746 | const char * options) | ||
747 | { | ||
748 | if (data_size < 4) | ||
749 | return 0; | ||
750 | if (0 != memcmp (data, "\177ELF", 4)) | ||
751 | return 0; | ||
752 | if (0 != proc (proc_cls, | ||
753 | "mymime", | ||
754 | EXTRACTOR_METATYPE_MIMETYPE, | ||
755 | EXTRACTOR_METAFORMAT_UTF8, | ||
756 | "text/plain", | ||
757 | "application/x-executable", | ||
758 | 1 + strlen("application/x-executable"))) | ||
759 | return 1; | ||
760 | /* more calls to 'proc' here as needed */ | ||
761 | return 0; | ||
762 | } | ||
763 | |||
764 | @end example | ||
765 | |||
766 | @section Plugin execution options | ||
767 | |||
768 | Plugins can request that their execution be done in a particular way. | ||
769 | For this, the plugin defines a function with the following signature: | ||
770 | |||
771 | @verbatim | ||
772 | const char * | ||
773 | EXTRACTOR_XXX_options (void); | ||
774 | @end verbatim | ||
775 | |||
776 | The function should return a string with the execution options. | ||
777 | Individual options in this string should be separated by semicolons. | ||
778 | Options that are included in the string but not known to the library | ||
779 | are ignored. The following options are supported: | ||
780 | |||
781 | @itemize @bullet | ||
782 | @item | ||
783 | @code{oop-only} ensures that the plugin is only run out-of-process; if | ||
784 | this is not possible, the plugin will not be executed at all if this | ||
785 | option is set. | ||
786 | |||
787 | @item | ||
788 | @code{close-stderr} ensures that @code{stderr} is closed during the | ||
789 | execution of the plugin. This is useful if the plugin uses libraries | ||
790 | that write (error) messages to @code{stderr} and where this behavior cannot be | ||
791 | turned off. This option only works if the plugin is executed out-of-process. | ||
792 | |||
793 | @item | ||
794 | @code{close-stdout} ensures that @code{stdout} is closed during the | ||
795 | execution of the plugin. This is useful if the plugin uses libraries | ||
796 | that write messages to @code{stdout} and where this behavior cannot be | ||
797 | turned off. This option only works if the plugin is executed out-of-process. | ||
798 | |||
799 | @item | ||
800 | @code{force-kill} kills and restarts the plugin process for each | ||
801 | file that is being analyzed. This is useful if the plugin uses | ||
802 | libraries that keep global state between runs that is problematic or | ||
803 | if the plugin uses libraries that are known to have serious resource | ||
804 | leaks (such as memory leaks). | ||
805 | |||
806 | @item | ||
807 | @code{want-tail} | ||
808 | In order to limit memory consumption, limit the amount if reading from | ||
809 | disk and to keep the API simple, the @samp{data} argument passed to | ||
810 | the @code{EXTRACTOR_XXX_extract} method bounded (to 32 MB of normal | ||
811 | data; for compressed data, a limit of 16 MB is imposed).@footnote{If | ||
812 | @gnule{} was given a pointer to an existing, uncompressed block of | ||
813 | data in memory, no bound is imposed for plugins executing in-process; | ||
814 | for out-of-process plugins, a 32 MB limit is still imposed.} Since | ||
815 | some file formats contain meta data at the end of the file, this option | ||
816 | provides a way for plugins to access not the first 16--32 MB of a file | ||
817 | but instead the last (roughly) 32 MB. | ||
818 | |||
819 | Note that even for files larger than 32 MB, @samp{size} is not | ||
820 | guaranteed to be 32 MB since @samp{data} will be aligned to the page | ||
821 | size of the operating system. However, the last byte of @samp{data} | ||
822 | is guaranteed to be the last byte of the file. Furthermore, if the | ||
823 | file was large and compressed, unlike in the case of meta data | ||
824 | extraction from the header, the end of the file will not be | ||
825 | automatically decompressed by @gnule{}. | ||
826 | |||
827 | @end itemize | ||
828 | |||
829 | Note that using options other than @code{want-tail} is pretty much | ||
830 | always a kludge and should thus be avoided. | ||
831 | |||
832 | @section Example for an options method | ||
833 | |||
834 | The following example shows how a plugin can set some of the options listed above: | ||
835 | @example | ||
836 | const char * | ||
837 | EXTRACTOR_id3_options () | ||
838 | { | ||
839 | return "close-stderr;want-tail"; | ||
840 | } | ||
841 | @end example | ||
842 | |||
727 | @node Internal utility functions | 843 | @node Internal utility functions |
728 | @chapter Internal utility functions | 844 | @chapter Internal utility functions |
729 | 845 | ||
@@ -752,12 +868,12 @@ below. | |||
752 | @cindex UTF-8 | 868 | @cindex UTF-8 |
753 | @cindex character set | 869 | @cindex character set |
754 | @findex EXTRACTOR_common_convert_to_utf8 | 870 | @findex EXTRACTOR_common_convert_to_utf8 |
755 | Various @le{} plugins make use of the internal | 871 | Various @gnule{} plugins make use of the internal |
756 | @file{convert.h} header which defines a function | 872 | @file{convert.h} header which defines a function |
757 | 873 | ||
758 | @verb{|EXTRACTOR_common_convert_to_utf8|} which can be used to easily convert text from | 874 | @verb{|EXTRACTOR_common_convert_to_utf8|} which can be used to easily convert text from |
759 | any character set to UTF-8. This conversion is important since the | 875 | any character set to UTF-8. This conversion is important since the |
760 | linked list of keywords that is returned by @le{} is | 876 | linked list of keywords that is returned by @gnule{} is |
761 | expected to contain only UTF-8 strings. Naturally, proper conversion | 877 | expected to contain only UTF-8 strings. Naturally, proper conversion |
762 | may not always be possible since some file formats fail to specify the | 878 | may not always be possible since some file formats fail to specify the |
763 | character set. In that case, it is often better to not convert at | 879 | character set. In that case, it is often better to not convert at |
@@ -781,9 +897,9 @@ caller, so storing the string in the keyword list is acceptable. | |||
781 | @chapter Reporting bugs | 897 | @chapter Reporting bugs |
782 | 898 | ||
783 | @cindex bug | 899 | @cindex bug |
784 | @le{} uses the @url{http://gnunet.org/bugs/,Mantis bugtracking | 900 | @gnule{} uses the @url{http://gnunet.org/bugs/,Mantis bugtracking |
785 | system}. If possible, please report bugs there. You can also e-mail | 901 | system}. If possible, please report bugs there. You can also e-mail |
786 | the @le{} mailinglist at @url{libextractor@@gnu.org}. | 902 | the @gnule{} mailinglist at @url{libextractor@@gnu.org}. |
787 | 903 | ||
788 | 904 | ||
789 | 905 | ||