Indexing multilingual words in lucene

不想你离开。 提交于 2020-01-14 06:03:06

问题


I am trying to index in Lucene a field that could have RDF literal in different languages. Most of the approaches I have seen so far are:

  • Use a single index, where each document has a field per each language it uses, or

  • Use M indexes, M being the number of languages in the corpus.

Lucene 2.9+ has a feature called Payload that allows to attach attributes to term. Is anyone use this mechanism to store language (or other attributes such as datatypes) information ? How is performance compared to the two other approaches ? Any pointer on source code showing how it is done would help. Thanks.


回答1:


It depends.

  1. Do you want to allow something like: "Search all english text for 'foo'"? If so, then you will need one field per language.
  2. Or do you want "Search all text for 'foo' and present the user with which language the match was found in?" If this is what you want, then either payloads or separate fields will work.
  3. An alternative way to do it is to index all your text in one field, then have another field saying the language of the document. (Assuming each document is in a single language.) Then your search would be something like +text:foo +language:english.

In terms of efficiency: you probably want to avoid payloads, since you would have to repeat the name of the language for every term, and you can't search based on payloads (at least not easily).




回答2:


so basically lucene is a ranking algorithm, it just looks at strings and compares them to other string. they can be encoded in different character encodings but their similarity is the same non the less. Just make sure you load the SnowBallAnalyzer with the supported langugage stemmer and you should get results. Like say Spanish or Chinese



来源:https://stackoverflow.com/questions/5264866/indexing-multilingual-words-in-lucene

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!