Getting the closest string match

后端 未结 13 720
难免孤独
难免孤独 2020-11-22 10:57

I need a way to compare multiple strings to a test string and return the string that closely resembles it:

TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW         


        
13条回答
  •  感情败类
    2020-11-22 11:28

    To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.

    Edit Distance ED(x,y): minimal number of transfroms to get from term x to term y

    But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.

    An advantage of Qgram index technique is it supports for Fuzzy Search.

    One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this. Following is a small example on storing of terms

    col : colmbia, colombo, gancola, tacolama

    Then when querying, we calculate the number of common Qgrams between query text and available terms.

    Example: x = HILLARY, y = HILARI(query term)
    Qgrams
    $$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
    $$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
    number of q-grams in common = 4
    

    number of q-grams in common = 4.

    For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.

    you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search

    To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)

提交回复
热议问题