aboutsummaryrefslogtreecommitdiff
path: root/README
blob: bc49d001d2e8d7da410f3a8993e9b0179fa8deb5 (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
libextractor
============

libextractor is a simple library for keyword extraction.  libextractor
does not support all formats but supports a simple plugging mechanism
such that you can quickly add extractors for additional formats, even
without recompiling libextractor.  libextractor typically ships with a
dozen helper-libraries that can be used to obtain keywords from common
file-types.

libextractor is a part of the GNU project (http://www.gnu.org/).



extract
=======

extract is a simple command-line interface to libextractor.



Dependencies
============

libextractor requires Python (2.3, better 2.4 including development
files) and a JNI header file (jni.h) for Java.  Further requirements
include:
* libvorbisfile
* zlib (compression library)
* c++ compiler
* libltdl (from GNU libtool)
* GNU gettext
* glib 2.6
* gtk 2.6 (for thumbnails, gdk-pixbuf)

When building libextractor binaries, please make sure all of these
dependencies are available.  Otherwise the build system may
automatically build only a subset of libextractor.



Writing plugins
===============


If you want to write your own extractor for some filetype, all you
need to do is write a little library that implements a single method
with this signature:


KeywordList * <libraryname>_extract(const char * filename,
                                    char * data,
                                    size_t size,
                                    KeywordList * prev,
                                    const char * options);

where <libraryname> is the name of the library file that you will tell
libExtractor to load, minus the suffix.  For example, if you link your
extractor into a file called 'myextractor.so', the method above should
be called 'myextractor_extract'.

The filename is the name of the file and maybe NULL, data is a pointer
to the contents of the file and size is the size of the file.  The
extract method must prepend keywords that it finds to the linked list
'prev' and return the new head.  The library must allocate (malloc)
the entry in the keyword list and the memory for the filename since
both will be free'ed by libextractor once the application calls
freeKeywords.

An example implementation can be found in mp3extractor.c.



Notes
=====

On Mac OS X, libextractor will avoid using GCC 3.1, because of
problems compiling one of the extractors.  GCC 3.3 and 2.95.2 are
known to work well; as such, libextractor will first look for 3.3 (by
attempting to run gcc-3.3, cpp-3.3, and g++-3.3) and then 2.95.2 (by
attempting to run gcc2 and g++2).

exiv2 requires G++ 3.0 or higher.  With older GCC versions (and other
broken C++ compilers), you have to manually disable exiv2 by passing
"--disable-exiv2" to "configure" in order to avoid compilation
problems.


If libextractor fails to find the plugins, a possible method of last
resort is to set the environment variable LIBEXTRACTOR_PREFIX to the
parent of the directory where the plugins are installed (i.e., if the
plugins are in "/foo/bar/lib/libextractor/*.so", set the variable to
"/foo/bar/lib").  This should not be needed if "extract" is in
"/foo/bar/bin/extract" and "/foo/bar/bin" is in the PATH, if you are
running Linux and "libextractor.so" is in
"/foo/bar/lib/libextractor.so", or if you are on linux and the binary
using libextractor resides in "/foo/bar/bin", or if you are under
Windows and "GetModuleFileName" returns "/foo/bar/bin".  If none of
these common circumstances apply, you may have to set the environment
variable.