Sentence aware search with Lucene SpanQueries

青春壹個敷衍的年華 提交于 2019-12-21 12:29:19

问题


Is it possible to use a Lucene SpanQuery to find all occurrences where the terms "red" "green" and "blue" all appear within a single sentence?

My first (incomplete/incorrect) approach is to write an analyzer that places a special sentence marker token and the beginning of a sentence in the same position as the first word of the sentence and to then query for something similar to the following:

SpanQuery termsInSentence = new SpanNearQuery(
  SpanQuery[] {
    new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN)),
    new SpanTermQuery( new Term ("red")),
    new SpanTermQuery( new Term ("green")),
    new SpanTermQuery( new Term ("blue")),
  },
  999999999999,
  false
);

SpanQuery nextSentence = new SpanTermQuery( new Term (MY_SPECIAL_SENTENCE_TOKEN));

SpanNotQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);

The problem, of course, is that nextSentence isn't really the next sentence, it's any sentence marker, including the one in the sentence that termsInSentence matches. Therefore this won't work.

My next approach is to create the analyzer that places the token before the sentence (that is before the first word rather than in the same position as the first word). The problem with this is that I then have to account for the extra offset caused by MY_SPECIAL_SENTENCE_TOKEN. What's more, this will particularly be bad at first when I'm using a naive pattern to split sentences (e.g. split on /\.\s+[A-Z0-9]/) because I'll have to account for all of the (false) sentence markers when I search for U. S. S. Enterprise.

So... how should I approach this?


回答1:


I would index each sentence as a Lucene document, including a field that marks what source document the sentence came from. Depending on your source material, the overhead of sentence/LuceneDoc may acceptable.




回答2:


Actually, looks like you are quite close to the solution. I think indexing an end-of-sentence flag is a good approach. The problem is that your end-of-sentence flag is in your SpanNearQuery, which is what is throwing you off. You are asking it to find a span which both contains and does not contain MY_SPECIAL_SENTENCE_TOKEN. The query contradicts itself, so, of course, it won't find any matches. What you really need to know, is that the three terms ("red", "green", and "blue") occur in a span that does not overlap with MY_SPECIAL_SENTENCE_TOKEN (that is, the sentence token doesn't appear in between those terms).

Also, the lack of field names in the Term ctors would be problem, but Lucene should throw an exception complaining about that, so guessing that's not the real problem here. Could be that the Lucene version at the time this was written did not complain about mismatched fields in SpanNears, so perhaps worth mentioning.

This appears to work to me:

SpanQuery termsInSentence = new SpanNearQuery(
    new SpanQuery[] {
        new SpanTermQuery( new Term ("text", "red")),
        new SpanTermQuery( new Term ("text", "green")),
        new SpanTermQuery( new Term ("text", "blue")),
    },
    9999, 
    false
);

SpanQuery nextSentence = new SpanTermQuery( new Term ("text", MY_SPECIAL_SENTENCE_TOKEN));

SpanQuery notInNextSentence = new SpanNotQuery(termsInSentence,nextSentence);

As far as where to split sentences, instead of using the naive regex approach, I would try using java.text.Breakiterator. It's not perfect, but it does a pretty good job.



来源:https://stackoverflow.com/questions/10593715/sentence-aware-search-with-lucene-spanqueries

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!