Strigi is moving along at a nice pace. To keep you all posted I’d like to report a bit on what exactly is the progress that has been achieved. Part of it is in SVN and will be in 0.3.10. Part of it has been released in 0.3.9. (0.3.10 is not too far away).
The current development model of Strigi has much in common with the 2.6 kernel line: new features are being added whilst keeping stability but without fear of breaking APIs.
Uptake in KDE4 can only happen if it is easy. To make life really easy for developers, Strigi can be used over DBus. This means you can do searches from your favorite language. Normally, for C++ developers this still requires generating code from the DBus introspection XML. This is a bit of effort that can be avoided by using the new pregenerated code that comes with Strigi. Two classes are included: StrigiClient and StrigiAsyncClient. Using them is easy: create an instance and call the functions on it. StrigiAsyncClient has an internal queue and allows you to use signals and slots. It also allows you to remove queries from the queue if you do not need them anymore. This is very common if you make a search-as-you-type widget. In the unlikely event that Strigi has not performed the query between keystrokes, these queries can be cancelled.
The current version of Strigi is very ambitious: it extracts all info
it can all the time. This is laudable, but not always required. A good
example is the Strigi program deepfind
. This program works
like ‘find’. It lists the paths of all files in a folder. Deepfind also
lists the paths of the files contained in other files (and deeper). So
the indexer code does not need to extract the full text of each file.
Using this knowledge can speed up deepfind
a lot.
The same holds for deepgrep
(the deep version of grep).
Deepgrep is not just an advanced grep, it can also serve as a good
fallback for searching in directories that have not been indexed. But
for this it should be as fast as possible. With the refactoring that has
been done, it is now possible to add a configuration to the indexer so
that it only extracts the values for which deepgrep
has a
search constraint.
Until recently, Strigi was not indexing non-ascii characters properly in the CLucene database. Internally, all strings in Strigi are UTF8, but CLucene has to store in UCS2 to be compatible with the Lucene index format and for this reason the strings must be converted before passing them on to the index. I never noticed that this was not happening properly, because I mainly use languages with a 26 letter alphabet. Migi pointed this flaw out to me and now this serious limitation has been fixed. China and Poland rejoice.
deepfind
and
deepgrep
I did not announce the 0.3.9 release on my blog yet. It’s been out
for a while and is the first version of Strigi to have
deepfind
and deepgrep
, the applications I
proposed at aKademy. These programs alone justify Strigi being included
in Vista.
Especially deepgrep
is cool. Did you ever feel like
grepping through your email attachements, your pdf files or your office
documents? Now you can!
Xmlindexer, like deepfind
and deepgrep
, is
another variation on the theme of exploiting Strigi’s libraries.
Xmlindexer walks through a directory and outputs an XML file containing
all the metadata and text it can extract from the the files it
encounters. This means that the Strigi’s powers of data extraction are
now available to all applications that can parse XML simply by calling
xmlindexer and parsing the output.
Freedesktop.org’s mailing list for standardization has seen some discussion about standardizing on metadata fields and search interfaces. Nothing definite’s come of it yet, but the discours is going in the right direction. Mikkel Kamstrup Erlandsen is keeping a running summary of the results.
Comments
Post a comment