I need a way to compare multiple strings to a test string and return the string that closely resembles it:
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.
Edit Distance ED(x,y): minimal number of transfroms to get from term x to term y
But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.
An advantage of Qgram index technique is it supports for Fuzzy Search.
One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this. Following is a small example on storing of terms
col : colmbia, colombo, gancola, tacolama
Then when querying, we calculate the number of common Qgrams between query text and available terms.
Example: x = HILLARY, y = HILARI(query term)
Qgrams
$$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
$$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
number of q-grams in common = 4
number of q-grams in common = 4.
For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.
you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search
To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)