问题
I am using Lucene (or more specifically Compass), to log threads in a forum and I need a way to extract the keywords behind the discussion. That said, I don't want to index every entry someone makes, but rather I'd have a list of 'keywords' that are relevant to a certain context and if the entry matches a keyword and is above a threshold I'd add these entries to the index.
I want to be able to use the power of an analyser to strip out things and do its magic, but then return the tokens from the analyser in order to match the keywords, and also count the number of occurrences certain words are being mentioned.
Is there a way to get the tokens from an analyser without having the overhead of indexing every entry made?
I was thinking I'd have to maintain a RAMDirectory to hold all entries, and then perform searches using my list of keywords, then merge the relevant Documents to the persistence manager to actually store the relevant entries.
回答1:
You are on the right path. You can create index of each document using RAMDirectory and then search on it to check that document contains relevant keyword. If no, discard that document. Else, you add it to the persistent/main index.
You don't need to hold all the documents in memory. It will consume a lot of memory unnecessarily.
回答2:
You should be able to skip using the RAMDirectory entirely. You can call the StandardAnalyzer
directly and get it to pass back a list of tokens to you (aka keywords).
StandardAnalyzer analyzer = new StandardAnalyzer;
TokenStream stream = analyzer.tokenStream("meaningless", new StringReader("<text>"));
while (true) {
Token token = stream.next();
if (token == null) break;
System.out.println(token.termText());
}
Better yet, write your own Analyzer (they're not hard, have a look at the source code for the existing ones) that uses your own filter to watch for your keywords.
来源:https://stackoverflow.com/questions/1571071/querying-lucene-tokens-without-indexing