问题
Given a term match in a document, what’s the best way to access words around that match? I have read this article http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, but the problem is that the Lucene API completely changed since this post(2009), could someone point to me how to do this in newer version of Lucene, such as Lucene 4.6.1?
EDIT:
I figure this out now (The postings APIs (TermEnum, TermDocsEnum, TermPositionsEnum) have been removed in favor of the new flexible indexing (flex) APIs (Fields, FieldsEnum, Terms, TermsEnum, DocsEnum, DocsAndPositionsEnum). One big difference is that field and terms are now enumerated separately: a TermsEnum provides a BytesRef (wraps a byte[]) per term within a single field, not a Term. Another is that when asking for a Docs/AndPositionsEnum, you now specify the skipDocs explicitly (typically this will be the deleted docs, but in general you can provide any Bits).):
public class TermVectorFun {
public static String[] DOCS = {
"The quick red fox jumped over the lazy brown dogs.",
"Mary had a little lamb whose fleece was white as snow.",
"Moby Dick is a story of a whale and a man obsessed.",
"The robber wore a black fleece jacket and a baseball cap.",
"The English Springer Spaniel is the best of all dogs.",
"The fleece was green and red",
"History looks fondly upon the story of the golden fleece, but most people don't agree"
};
public static void main(String[] args) throws IOException {
RAMDirectory ramDir = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
//Index some made up content
IndexWriter writer = new IndexWriter(ramDir, config);
for (int i = 0; i < DOCS.length; i++) {
Document doc = new Document();
Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
doc.add(id);
//Store both position and offset information
Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
//Get a searcher
DirectoryReader dirReader = DirectoryReader.open(ramDir);
IndexSearcher searcher = new IndexSearcher(dirReader);
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
int window = 2;//get the words within two of the match
while (spans.next() == true) {
int start = spans.start() - window;
int end = spans.end() + window;
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
Fields fields = reader.getTermVectors(spans.doc());
Terms terms = fields.terms("content");
TermsEnum termsEnum = terms.iterator(null);
BytesRef text;
while((text = termsEnum.next()) != null) {
//could store the BytesRef here, but String is easier for this example
String s = new String(text.bytes, text.offset, text.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
}
}
回答1:
Use Highlighter. Highlighter.getBestFragment
can be used to get a portion of the text containing the best match. Something like:
TopDocs docs = searcher.search(query, maxdocs);
Document firstDoc = search.doc(docs.scoreDocs[0].doc);
Scorer scorer = new QueryScorer(query)
Highlighter highlighter = new Highlighter(scorer);
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));
来源:https://stackoverflow.com/questions/25814445/accessing-words-around-a-positional-match-in-lucene