Solr, Special Chars, and Latin to Cyrillic char conversion

独自空忆成欢 提交于 2019-12-14 03:48:49

问题


I am trying to setup a search engine using Solr (or Lucene) which could have text in both Latin with special chars, (special chars would include Ö or Ç as an example) or Cyrilic chars (examples include Б or б and Ж ж).

Anyway, I am trying to find a solution to allow me to search for words with these charicters in them, but for users who do not have the key on their keyboard...

Example would be (making up words here, hopefully won't offend anyone):

  • "BÖÖK" would be found when searching for "book"
  • "ЖRAY" would be found when searching for XRAY
  • "ЖRAY" would also be found if searching for ZRAY, ZHRAY, or žray (see GOST 16876-71 for info on Transliteration of Cylric to Latin Char.

So, how should I go about this? Some theories I have are:

  • allow multiple text fields to be stored for each original string, one in original form, one in the first pass of transliteration (which, for example, would convert Ö to just O and Ж to ž, but also X) and then one in the third form (from the ž to z or zh) -> means I will be storing a LOT of data...
  • store in solr as is, and let Solr do the magic -> don't know how well this will work... can't see anything in solr to do this
  • Magic bullet I have not found yet...

Any ideas? Anyone tried this before?


回答1:


Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.




回答2:


You need to use the accent filter in your index and query text analysis, which would convert foreign characters to their english version

You can use ISOLatin1AccentFilterFactory or ASCIIFoldingFilterFactory depending upon the Solr version you are using.

e.g.

 <filter class="solr.ASCIIFoldingFilterFactory" />

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ISOLatin1AccentFilterFactory
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

So - "BÖÖK" would be converted and indexed as "book" in Solr.
This would enable the users to search for both, book and BÖÖK and still get back the document.



来源:https://stackoverflow.com/questions/7662547/solr-special-chars-and-latin-to-cyrillic-char-conversion

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!