I need a way to compare multiple strings to a test string and return the string that closely resembles it:
TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.
Edit Distance ED(x,y): minimal number of transfroms to get from term x to term y
But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.
An advantage of Qgram index technique is it supports for Fuzzy Search.
One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this. Following is a small example on storing of terms
col : colmbia, colombo, gancola, tacolama
Then when querying, we calculate the number of common Qgrams between query text and available terms.
Example: x = HILLARY, y = HILARI(query term)
Qgrams
$$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
$$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
number of q-grams in common = 4
number of q-grams in common = 4.
For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.
you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search
To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)
You might be interested in this blog post.
http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
Fuzzywuzzy is a Python library that provides easy distance measures such as Levenshtein distance for string matching. It is built on top of difflib in the standard library and will make use of the C implementation Python-levenshtein if available.
http://pypi.python.org/pypi/python-Levenshtein/
Here you can have a golang POC for calculate the distances between the given words. You can tune the minDistance
and difference
for other scopes.
Playground: https://play.golang.org/p/NtrBzLdC3rE
package main
import (
"errors"
"fmt"
"log"
"math"
"strings"
)
var data string = `THE RED COW JUMPED OVER THE GREEN CHICKEN-THE RED COW JUMPED OVER THE RED COW-THE RED FOX JUMPED OVER THE BROWN COW`
const minDistance float64 = 2
const difference float64 = 1
type word struct {
data string
letters map[rune]int
}
type words struct {
words []word
}
// Print prettify the data present in word
func (w word) Print() {
var (
lenght int
c int
i int
key rune
)
fmt.Printf("Data: %s\n", w.data)
lenght = len(w.letters) - 1
c = 0
for key, i = range w.letters {
fmt.Printf("%s:%d", string(key), i)
if c != lenght {
fmt.Printf(" | ")
}
c++
}
fmt.Printf("\n")
}
func (ws words) fuzzySearch(data string) ([]word, error) {
var (
w word
err error
founds []word
)
w, err = initWord(data)
if err != nil {
log.Printf("Errors: %s\n", err.Error())
return nil, err
}
// Iterating all the words
for i := range ws.words {
letters := ws.words[i].letters
//
var similar float64 = 0
// Iterating the letters of the input data
for key := range w.letters {
if val, ok := letters[key]; ok {
if math.Abs(float64(val-w.letters[key])) <= minDistance {
similar += float64(val)
}
}
}
lenSimilarity := math.Abs(similar - float64(len(data)-strings.Count(data, " ")))
log.Printf("Comparing %s with %s i've found %f similar letter, with weight %f", data, ws.words[i].data, similar, lenSimilarity)
if lenSimilarity <= difference {
founds = append(founds, ws.words[i])
}
}
if len(founds) == 0 {
return nil, errors.New("no similar found for data: " + data)
}
return founds, nil
}
func initWords(data []string) []word {
var (
err error
words []word
word word
)
for i := range data {
word, err = initWord(data[i])
if err != nil {
log.Printf("Error in index [%d] for data: %s", i, data[i])
} else {
words = append(words, word)
}
}
return words
}
func initWord(data string) (word, error) {
var word word
word.data = data
word.letters = make(map[rune]int)
for _, r := range data {
if r != 32 { // avoid to save the whitespace
word.letters[r]++
}
}
return word, nil
}
func main() {
var ws words
words := initWords(strings.Split(data, "-"))
for i := range words {
words[i].Print()
}
ws.words = words
solution, _ := ws.fuzzySearch("THE BROWN FOX JUMPED OVER THE RED COW")
fmt.Println("Possible solutions: ", solution)
}
Lua implementation, for posterity:
function levenshtein_distance(str1, str2)
local len1, len2 = #str1, #str2
local char1, char2, distance = {}, {}, {}
str1:gsub('.', function (c) table.insert(char1, c) end)
str2:gsub('.', function (c) table.insert(char2, c) end)
for i = 0, len1 do distance[i] = {} end
for i = 0, len1 do distance[i][0] = i end
for i = 0, len2 do distance[0][i] = i end
for i = 1, len1 do
for j = 1, len2 do
distance[i][j] = math.min(
distance[i-1][j ] + 1,
distance[i ][j-1] + 1,
distance[i-1][j-1] + (char1[i] == char2[j] and 0 or 1)
)
end
end
return distance[len1][len2]
end
The problem is hard to implement if the input data is too large (say millions of strings). I used elastic search to solve this.
Quick start : https://www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html
Just insert all the input data into DB and you can search any string based on any edit distance quickly. Here is a C# snippet which will give you a list of results sorted by edit distance (smaller to higher)
var res = client.Search<ClassName>(s => s
.Query(q => q
.Match(m => m
.Field(f => f.VariableName)
.Query("SAMPLE QUERY")
.Fuzziness(Fuzziness.EditDistance(5))
)
));
This problem turns up all the time in bioinformatics. The accepted answer above (which was great by the way) is known in bioinformatics as the Needleman-Wunsch (compare two strings) and Smith-Waterman (find an approximate substring in a longer string) algorithms. They work great and have been workhorses for decades.
But what if you have a million strings to compare? That's a trillion pairwise comparisons, each of which is O(n*m)! Modern DNA sequencers easily generate a billion short DNA sequences, each about 200 DNA "letters" long. Typically, we want to find, for each such string, the best match against the human genome (3 billion letters). Clearly, the Needleman-Wunsch algorithm and its relatives will not do.
This so-called "alignment problem" is a field of active research. The most popular algorithms are currently able to find inexact matches between 1 billion short strings and the human genome in a matter of hours on reasonable hardware (say, eight cores and 32 GB RAM).
Most of these algorithms work by quickly finding short exact matches (seeds) and then extending these to the full string using a slower algorithm (for example, the Smith-Waterman). The reason this works is that we are really only interested in a few close matches, so it pays off to get rid of the 99.9...% of pairs that have nothing in common.
How does finding exact matches help finding inexact matches? Well, say we allow only a single difference between the query and the target. It is easy to see that this difference must occur in either the right or left half of the query, and so the other half must match exactly. This idea can be extended to multiple mismatches and is the basis for the ELAND algorithm commonly used with Illumina DNA sequencers.
There are many very good algorithms for doing exact string matching. Given a query string of length 200, and a target string of length 3 billion (the human genome), we want to find any place in the target where there is a substring of length k that matches a substring of the query exactly. A simple approach is to begin by indexing the target: take all k-long substrings, put them in an array and sort them. Then take each k-long substring of the query and search the sorted index. Sort and search can be done in O(log n) time.
But storage can be a problem. An index of the 3 billion letter target would need to hold 3 billion pointers and 3 billion k-long words. It would seem hard to fit this in less than several tens of gigabytes of RAM. But amazingly we can greatly compress the index, using the Burrows-Wheeler transform, and it will still be efficiently queryable. An index of the human genome can fit in less than 4 GB RAM. This idea is the basis of popular sequence aligners such as Bowtie and BWA.
Alternatively, we can use a suffix array, which stores only the pointers, yet represents a simultaneous index of all suffixes in the target string (essentially, a simultaneous index for all possible values of k; the same is true of the Burrows-Wheeler transform). A suffix array index of the human genome will take 12 GB of RAM if we use 32-bit pointers.
The links above contain a wealth of information and links to primary research papers. The ELAND link goes to a PDF with useful figures illustrating the concepts involved, and shows how to deal with insertions and deletions.
Finally, while these algorithms have basically solved the problem of (re)sequencing single human genomes (a billion short strings), DNA sequencing technology improves even faster than Moore's law, and we are fast approaching trillion-letter datasets. For example, there are currently projects underway to sequence the genomes of 10,000 vertebrate species, each a billion letters long or so. Naturally, we will want to do pairwise inexact string matching on the data...