Build an index for substring search?

守給你的承諾、 提交于 2019-11-30 04:58:01

问题


I want to do general substring search among billions of strings. The requirement is a little different from general fulltext search because I want a query "ubst" also can hit "substr".

Is Lucene or Sphinx capable of doing this? If not, what's the best way do you think to do this?


回答1:


Best index structure for this case is suffix tree Lucene does not implements this type of index so its substring search is slow. But lucene has prefix tree index which mean you can do fast search if you search terms by their prefix.




回答2:


Lucene is one of the best available options. Lucene supports sub string search so ubst will return substr.

check out http://wiki.apache.org/lucene-java/LuceneImplementations for suitable language implementation.




回答3:


Sphinx does support effective substring searches since Version 2.0.1-beta, 22 apr 2011. Unfortunately as of today this support regards only beta versions, as mentioned here.

I made a try with 2.1.1 beta version. It seems to work correctly. See the manual entry for dictionary type, read about keywords type.

When I tried to use 2.0.6 release version, it fell back to inefficient crc index, giving the following warning during indexing:

WARNING: min_infix_len is not supported yet with dict=keywords; using dict=crc

My minimal configuration file:

source sour
{
  type = xmlpipe2
  xmlpipe_command = type C:\Temp\1\sphinx\input.xml
}

index inde
{
  source = sour
  path = testpa
  enable_star = 1
  dict = keywords
  charset_type = utf-8
  min_infix_len = 1
}


来源:https://stackoverflow.com/questions/6838690/build-an-index-for-substring-search

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!