How to get frequently occurring phrases with Lucene

本秂侑毒 提交于 2019-11-29 02:32:35

Julia, It seems what you are looking for is n-grams, specifically Bigrams (also called collocations).

Here's a chapter about finding collocations (PDF) from Manning and Schutze's Foundations of Statistical Natural Language Processing.

In order to do this with Lucene, I suggest using Solr with ShingleFilterFactory. Please see this discussion for details.

Is it possible for you to post any code that you have written?

Basically a lot depends on the way you create your fields and store documents in lucene.

Lets consider a case where I have got two fields: ID and Comments; and in my ID field I allow values like this 'finding nemo' i.e. strings with space. Whereas 'Comments' is a free flow text field i.e. I allow anything and everything which my keyboard allows and what lucene can understand.

Now in real life scenario it does not make sense to make my ID:'finding nemo' as two different searchable string. Whereas I want to index everything in Comments.

So what I will do is, I will create a document (org.apache.lucene.document.Document) object to take care of this... Something like this

Document doc = new Document();
doc.add(new Field("comments","Finding nemo was a very tough job for a clown fish ...", Field.Store.YES, Field.Index.ANALYZED));
doc.add(new Field("id", "finding nemo", Field.Store.YES, Field.Index.NOT_ANALYZED));

So, essentially I have created two fields:

  1. comments: Where I have preferred to analyze it by using Field.Index.ANALYZED
  2. id: Where I directed lucene to store it but do not analyze it Field.Index.NOT_ANALYZED

This is how you customize lucene for Default Tokenizer and analyser. Otherwise you can write your own Tokenizer and analyzers.

Link(s) http://darksleep.com/lucene/

Hope this will help you... :)

Abhinav Upadhyay

Well the problem of losing the context for phrases can be solved by using PhraseQuery.

An index by default contains positional information of terms, as long as you did not create pure Boolean fields by indexing with the omitTermFreqAndPositions option. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another.

For example, suppose a field contained the phrase “the quick brown fox jumped over the lazy dog”. Without knowing the exact phrase, you can still find this document by searching for documents with fields having quick and fox near each other. Sure, a plain TermQuery would do the trick to locate this document knowing either of those words, but in this case we only want documents that have phrases where the words are either exactly side by side (quick fox) or have one word in between (quick [irrelevant] fox). The maximum allowable positional distance between terms to be considered a match is called slop. Distance is the number of positional moves of terms to reconstruct the phrase in order.

Check out Lucene's JavaDoc for PhraseQuery

See this example code which demonstrates how to work with various Query Objects:

You can also try to combine various query types with the help of the BooleanQuery class.

And regarding the frequency of phrases, I suppose Lucene's scoring considers the frequency of the terms occurring in the documents.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!