How to use n-grams in whoosh

蹲街弑〆低调 提交于 2020-01-14 09:34:05

问题


I'm trying to use n-grams to get "autocomplete-style" searches using Whoosh. Unfortunately I'm a little confused. I have made an index like this:

if not os.path.exists("index"):
    os.mkdir("index")
ix = create_in("index", schema)

ix = open_dir("index")

writer = ix.writer()
q = MyTable.select()
for item in q:
    print 'adding %s' % item.Title
    writer.add_document(title=item.Title, content=item.content, url = item.URL)
writer.commit()

I then search it for the title field like this:

querystring = 'my search string'

parser = QueryParser("title", ix.schema)
myquery = parser.parse(querystring)

with ix.searcher() as searcher:
    results = searcher.search(myquery)
    print len(results)

    for r in results:
        print r

and that works great. But I want to use this in autocomplete and it doesn't match partial words (eg searching for "ant" would return "ant", but not "antelope" or "anteater"). That of course greatly hampers using it for autocomplete. The Whoosh page says to use this:

analyzer = analysis.NgramWordAnalyzer()
title_field = fields.TEXT(analyzer=analyzer, phrase=False)
schema = fields.Schema(title=title_field)

But I'm confused by that. It seems to be just "the middle" of the process, when I build my index do I have to include the title field as an NGRAM field (instead of TEXT)? And how do I make a search? So when I search "ant" I get ["ant", "anteater", "antelope"] etc?


回答1:


I solved this problem by creating two seperate fields. One for the actual search and one for the suggestions. NGRAM or NGRAMWORDS field type can be used for "fuzzy search" functionality. In your case it would be something like this:

# not sure how your schema looks like exactly
schema = Schema(
    title=NGRAMWORDS(minsize=2, maxsize=10, stored=True, field_boost=1.0, tokenizer=None, at='start', queryor=False, sortable=False)
    content=TEXT(stored=True),
    url=title=ID(stored=True),
    spelling=TEXT(stored=True, spelling=True)) # typeahead field

if not os.path.exists("index"):
os.mkdir("index")
ix = create_in("index", schema)

ix = open_dir("index")

writer = ix.writer()
q = MyTable.select()
for item in q:
    print 'adding %s' % item.Title
    writer.add_document(title=item.Title, content=item.content, url = item.URL)
    writer.add_document(spelling=item.Title) # adding item title to typeahead field
    self.addContentToSpelling(writer, item.content) # some method that adds some content words to typeheadfield if needed. The same way as above.
writer.commit()

Then when for the search:

origQueryString = 'my search string'
words = self.splitQuery(origQueryString) # use tokenizers / analyzers or self implemented
queryString = origQueryString # would be better to actually create a query
corrector = ix.searcher().corrector("spelling")
for word in words:
    suggestionList = corrector.suggest(word, limit=self.limit)
    for suggestion in suggestionList:
         queryString = queryString + " " + suggestion # would be better to actually create a query      

parser = QueryParser("title", ix.schema)
myquery = parser.parse(querystring)

with ix.searcher() as searcher:
     results = searcher.search(myquery)
     print len(results)

    for r in results:
        print r

Hope you get the idea.



来源:https://stackoverflow.com/questions/20040977/how-to-use-n-grams-in-whoosh

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!