问题
I have got some direction from this question. I first make the index like below.
import lucene
from org.apache.lucene.analysis.standard import StandardAnalyzer
from org.apache.lucene.index import IndexWriterConfig, IndexWriter, DirectoryReader
from org.apache.lucene.store import SimpleFSDirectory
from java.nio.file import Paths
from org.apache.lucene.document import Document, Field, TextField
from org.apache.lucene.util import BytesRefIterator
index_path = "./index"
lucene.initVM()
analyzer = StandardAnalyzer()
config = IndexWriterConfig(analyzer)
if len(os.listdir(index_path))>0:
config.setOpenMode(IndexWriterConfig.OpenMode.APPEND)
store = SimpleFSDirectory(Paths.get(index_path))
writer = IndexWriter(store, config)
doc = Document()
doc.add(Field("docid", "1", TextField.TYPE_STORED))
doc.add(Field("title", "qwe rty", TextField.TYPE_STORED))
doc.add(Field("description", "uio pas", TextField.TYPE_STORED))
writer.addDocument(doc)
writer.close()
store.close()
I then try to get all the terms in the index for one field like below.
store = SimpleFSDirectory(Paths.get(index_path))
reader = DirectoryReader.open(store)
Attempt 1: trying to use the next()
as used in this question which seems to be a method of BytesRefIterator
implemented by TermsEnum
.
for lrc in reader.leaves():
terms = lrc.reader().terms('title')
terms_enum = terms.iterator()
while terms_enum.next():
term = terms_enum.term()
print(term.utf8ToString())
However, I can't seem to be able to access that next()
method.
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-47-6515079843a0> in <module>
2 terms = lrc.reader().terms('title')
3 terms_enum = terms.iterator()
----> 4 while terms_enum.next():
5 term = terms_enum.term()
6 print(term.utf8ToString())
AttributeError: 'TermsEnum' object has no attribute 'next'
Attempt 2: trying to change the while loop as suggested in the comments of this question.
while next(terms_enum):
term = terms_enum.term()
print(term.utf8ToString())
However, it seems TermsEnum
is not understood to be an iterator by Python.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-48-d490ad78fb1c> in <module>
2 terms = lrc.reader().terms('title')
3 terms_enum = terms.iterator()
----> 4 while next(terms_enum):
5 term = terms_enum.term()
6 print(term.utf8ToString())
TypeError: 'TermsEnum' object is not an iterator
I am aware that my question can be answered as suggested in this question. Then I guess my question really is, how do I get all the terms in TermsEnum
?
回答1:
I found that the below works from here and from test_FieldEnumeration()
in the test_Pylucene.py
file which is in pylucene-8.6.1/test3/
.
for term in BytesRefIterator.cast_(terms_enum):
print(term.utf8ToString())
Happy to accept an answer that has more explanation than this.
来源:https://stackoverflow.com/questions/64960527/how-to-get-a-list-of-all-tokens-from-lucene-8-6-1-index-using-pylucene