why tokenize texts in lucene?

人盡茶涼 提交于 2020-01-06 15:11:15

问题


I'm beginner of lucene. Here's my source:

ft = new FieldType(StringField.TYPE_STORED);
    ft.setTokenized(false);
    ft.setStored(true);
    ftNA = new FieldType(StringField.TYPE_STORED);
    ftNA.setTokenized(true);
    ftNA.setStored(true);

Why tokenized in lucene? For example: the String value of "my name is lee"

  • case tokenized, "my" "name" "is" "lee"
  • case not tokenized, "my name is lee"

I'dont understand why indexing by tokenized. What is the difference between tokenized and not tokenized?


回答1:


Lucene works by finding tokens in documents which satisfy constraints expressed by a query.

If you search for lee for instance, the query will find all documents that contain the token lee. If the field isn't tokenized, you'll only be able to find my name is lee, but not just lee for instance.

Now suppose you search for "is lee". This is a PhraseQuery, which means it'll match the token is followed by the token lee.

Tokenization is needed because Lucene works with an inverted index, ie it maps tokens to the documents that contain them.



来源:https://stackoverflow.com/questions/29457148/why-tokenize-texts-in-lucene

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!