Lucene, indexing already/externally tokenized tokens and defining own analyzing process

问题

in the process of using Lucene, I am a bit disapointed. I do not see or understand how i should proceed to feed any Lucene analyzers with something that is already and directly indexable. Or how i should proceed to create my own analyzer...

for example, if i have a List<MyCustomToken>, which already contains many tokens (and actually many more informations about capitalization, etc. that i would also like to index as features on each of MyCustomToken)

if i understand well what i have read, i need to subclass an Analyzer, which will call my own tokenizer subclassing a TokenStream, where i will only have to provide a public final boolean incrementToken() that will do the job of inserting TermAttribute @ position.

BTW here is where i am confused => this TokenStream is a subclass of a java.io.Reader, and thus only capable of analyzing a stream object like a file, a string...

how can i proceed to have my own document analyzer that will consume my List rather thant this stream-ed one?

Looks like the whole Lucene API is builded on the idea that it first starts analyzing @ a very low level that are "characters" point of view, while i need to start using it later / plug from an already tokenized words or even expressions (groups of words).

Typical samples of Lucene usage are like this (taken from here) :

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_35);

// 1. create the index
Directory index = new RAMDirectory();

IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_35, analyzer);

IndexWriter w = new IndexWriter(index, config);
addDoc(w, "Lucene in Action");   // BUT here i would like to have a addDoc(w, MyOwnObject)
addDoc(w, "Lucene for Dummies");
addDoc(w, "Managing Gigabytes");
addDoc(w, "The Art of Computer Science");
w.close();

[...]   

private static void addDoc(IndexWriter w, String value) throws IOException {
  Document doc = new Document();
  doc.add(new Field("title", value, Field.Store.YES, Field.Index.ANALYZED));
  // SO that i can add here my own analysis base on many fields, with them built from a walk through List or complex structures...
  w.addDocument(doc);
}

ps : (my java/lucene knowledge is still very poor, so i may have miss something obvious about the Reader <=> List pattern?)

this question is almost the same as mine on lucene list

EDIT: @ Jilles van Gurp => yes you are quite right, and it was another issue a i think of, but first hope to find a more elegant solution. So, if continuing, I can still do some kind of serialization, feed this serialized string as a document to my own analyzer, and own tokenizer that will then deserialize and re-do some basic tokenization (actually, just walking through the one already done...) BTW it will add some slower and stupid extra steps that i would have like to avoid...

about this part => does someone have any sample of a recent (Lucene >3.6) custom tokenizer providing all the underlying data necessary to a Lucene Index? i have read about emitting Token like that :

        posIncrement.setPositionIncrement(increment); 
        char[] asCharArray = myAlreadyTokenizedString.toCharArray(); // here is my workaround 
        termAttribute.copyBuffer(asCharArray, 0, asCharArray.length); 
        //termAttribute.setTermBuffer(kept); 
        position++;

for the why i am here part, it is because i use some external libraries, that tokenize my texts, do some part-of-speech annotation, and others analysis (one may think of a expression recognition or named entity recognition, can also include some special features about capitalization, etc.) that i would like to keep track in a Lucene Index (the real part that interest me is Indexing and Querying, not the first step of Analysis which is almost from the Lucene libary only Tokenising for what i have read).

(also, i do not think i can do something smarter from these previous/early steps as i use many different tools, not all of them are Java or could be easily wrapped to Java)

so i think this is a bit sad, that Lucene which is aim @ working with text is so bounded to word/tokens (sequence of chars) while text is much more than only juxtaposition of single/isolated words/tokens...

回答1:

Instead of tryin to implement something like addDoc(w, MyOwnObject), could you use MyOwnObject.toString() and implement a @Override String toString() in your MyOwnObject class?

回答2:

Lucene is designed to index text, which generally comes in the form of a sequence of chars. So, the Analyzer framework is all about analyzing text and transforming it into tokens.

Now you somehow ended up with a list of tokens and now want to feed it into lucene. That doesn't quite fit the usecase lucene is optimized for. The easiest way is simply representing the list as a string (e.g. comma separated) and then implement a simple TokenStream that separates on whatever you chose as the separator.

Now the real question is how you ended up with the list and whether you can do something smarter there, but I lack the insight in your use case to make sense of that.

来源：https://stackoverflow.com/questions/11142221/lucene-indexing-already-externally-tokenized-tokens-and-defining-own-analyzing

标签

java

lucene

indexing

tokenize