Getting the closest string match

后端未结

关注

 13  736

I need a way to compare multiple strings to a test string and return the string that closely resembles it:

TEST STRING: THE BROWN FOX JUMPED OVER THE RED COW


                      
              相关标签:


      
      
        
          13条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  感情败类        
                
              
                            
                2020-11-22 11:28
              
            
            
                                                                       
To query a large set of text in efficient manner you can use the concept of Edit Distance/ Prefix Edit Distance.


  Edit Distance ED(x,y): minimal number of transfroms to get from term x to term y


But computing ED between each term and query text is resource and time intensive. Therefore instead of calculating ED for each term first we can extract possible matching terms using a technique called Qgram Index. and then apply ED calculation on those selected terms.

An advantage of Qgram index technique is it supports for Fuzzy Search.

One possible approach to adapt QGram index is build an Inverted Index using Qgrams. In there we store all the words which consists with particular Qgram, under that Qgram.(Instead of storing full string you can use unique ID for each string). You can use Tree Map data structure in Java for this.
Following is a small example on storing of terms


  col : colmbia, colombo, gancola, tacolama


Then when querying, we calculate the number of common Qgrams between query text and available terms.

Example: x = HILLARY, y = HILARI(query term)
Qgrams
$$HILLARY$$ -> $$H, $HI, HIL, ILL, LLA, LAR, ARY, RY$, Y$$
$$HILARI$$ -> $$H, $HI, HIL, ILA, LAR, ARI, RI$, I$$
number of q-grams in common = 4


number of q-grams in common = 4.

For the terms with high number of common Qgrams, we calculate the ED/PED against the query term and then suggest the term to the end user.

you can find an implementation of this theory in following project(See "QGramIndex.java"). Feel free to ask any questions. https://github.com/Bhashitha-Gamage/City_Search

To study more about Edit Distance, Prefix Edit Distance Qgram index please watch the following video of Prof. Dr Hannah Bast https://www.youtube.com/embed/6pUg2wmGJRo (Lesson starts from 20:06)
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-11-22 11:29
              
            
            
                                                                       
You might be interested in this blog post.

http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

Fuzzywuzzy is a Python library that provides easy distance measures such as Levenshtein distance for string matching. It is built on top of difflib in the standard library and will make use of the C implementation Python-levenshtein if available.

http://pypi.python.org/pypi/python-Levenshtein/
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2020-11-22 11:30
              
            
            
                                                                       
Here you can have a golang POC for calculate the distances between the given words. You can tune the minDistance and difference for other scopes.

Playground: https://play.golang.org/p/NtrBzLdC3rE

package main

import (
    "errors"
    "fmt"
    "log"
    "math"
    "strings"
)

var data string = `THE RED COW JUMPED OVER THE GREEN CHICKEN-THE RED COW JUMPED OVER THE RED COW-THE RED FOX JUMPED OVER THE BROWN COW`

const minDistance float64 = 2
const difference float64 = 1

type word struct {
    data    string
    letters map[rune]int
}

type words struct {
    words []word
}

// Print prettify the data present in word
func (w word) Print() {
    var (
        lenght int
        c      int
        i      int
        key    rune
    )
    fmt.Printf("Data: %s\n", w.data)
    lenght = len(w.letters) - 1
    c = 0
    for key, i = range w.letters {
        fmt.Printf("%s:%d", string(key), i)
        if c != lenght {
            fmt.Printf(" | ")
        }
        c++
    }
    fmt.Printf("\n")
}

func (ws words) fuzzySearch(data string) ([]word, error) {
    var (
        w      word
        err    error
        founds []word
    )
    w, err = initWord(data)
    if err != nil {
        log.Printf("Errors: %s\n", err.Error())
        return nil, err
    }
    // Iterating all the words
    for i := range ws.words {
        letters := ws.words[i].letters
        //
        var similar float64 = 0
        // Iterating the letters of the input data
        for key := range w.letters {
            if val, ok := letters[key]; ok {
                if math.Abs(float64(val-w.letters[key])) <= minDistance {
                    similar += float64(val)
                }
            }
        }

        lenSimilarity := math.Abs(similar - float64(len(data)-strings.Count(data, " ")))
        log.Printf("Comparing %s with %s i've found %f similar letter, with weight %f", data, ws.words[i].data, similar, lenSimilarity)
        if lenSimilarity <= difference {
            founds = append(founds, ws.words[i])
        }
    }

    if len(founds) == 0 {
        return nil, errors.New("no similar found for data: " + data)
    }

    return founds, nil
}

func initWords(data []string) []word {
    var (
        err   error
        words []word
        word  word
    )
    for i := range data {
        word, err = initWord(data[i])
        if err != nil {
            log.Printf("Error in index [%d] for data: %s", i, data[i])
        } else {
            words = append(words, word)
        }
    }
    return words

}

func initWord(data string) (word, error) {
    var word word

    word.data = data
    word.letters = make(map[rune]int)
    for _, r := range data {
        if r != 32 { // avoid to save the whitespace
            word.letters[r]++
        }

    }
    return word, nil
}
func main() {
    var ws words
    words := initWords(strings.Split(data, "-"))
    for i := range words {
        words[i].Print()
    }
    ws.words = words

    solution, _ := ws.fuzzySearch("THE BROWN FOX JUMPED OVER THE RED COW")
    fmt.Println("Possible solutions: ", solution)

}


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2020-11-22 11:33
              
            
            
                                                                       
Lua implementation, for posterity:

function levenshtein_distance(str1, str2)
    local len1, len2 = #str1, #str2
    local char1, char2, distance = {}, {}, {}
    str1:gsub('.', function (c) table.insert(char1, c) end)
    str2:gsub('.', function (c) table.insert(char2, c) end)
    for i = 0, len1 do distance[i] = {} end
    for i = 0, len1 do distance[i][0] = i end
    for i = 0, len2 do distance[0][i] = i end
    for i = 1, len1 do
        for j = 1, len2 do
            distance[i][j] = math.min(
                distance[i-1][j  ] + 1,
                distance[i  ][j-1] + 1,
                distance[i-1][j-1] + (char1[i] == char2[j] and 0 or 1)
                )
        end
    end
    return distance[len1][len2]
end

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2020-11-22 11:33
              
            
            
                                                                       
The problem is hard to implement if the input data is too large (say millions of strings). I used elastic search to solve this. 

Quick start : https://www.elastic.co/guide/en/elasticsearch/client/net-api/6.x/elasticsearch-net.html

Just insert all the input data into DB and you can search any string based on any edit distance quickly. Here is a C# snippet which will give you a list of results sorted by edit distance (smaller to higher)

var res = client.Search<ClassName>(s => s
    .Query(q => q
    .Match(m => m
        .Field(f => f.VariableName)
        .Query("SAMPLE QUERY")
        .Fuzziness(Fuzziness.EditDistance(5))
    )
));

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  悲哀的现实        
                
              
                            
                2020-11-22 11:36
              
            
            
                                                                       
This problem turns up all the time in bioinformatics. The accepted answer above (which was great by the way) is known in bioinformatics as the Needleman-Wunsch (compare two strings) and Smith-Waterman (find an approximate substring in a longer string) algorithms. They work great and have been workhorses for decades.

But what if you have a million strings to compare? That's a trillion pairwise comparisons, each of which is O(n*m)! Modern DNA sequencers easily generate a billion short DNA sequences, each about 200 DNA "letters" long. Typically, we want to find, for each such string, the best match against the human genome (3 billion letters). Clearly, the Needleman-Wunsch algorithm and its relatives will not do.

This so-called "alignment problem" is a field of active research. The most popular algorithms are currently able to find inexact matches between 1 billion short strings and the human genome in a matter of hours on reasonable hardware (say, eight cores and 32 GB RAM).

Most of these algorithms work by quickly finding short exact matches (seeds) and then extending these to the full string using a slower algorithm (for example, the Smith-Waterman). The reason this works is that we are really only interested in a few close matches, so it pays off to get rid of the 99.9...% of pairs that have nothing in common.

How does finding exact matches help finding inexact matches? Well, say we allow only a single difference between the query and the target. It is easy to see that this difference must occur in either the right or left half of the query, and so the other half must match exactly. This idea can be extended to multiple mismatches and is the basis for the ELAND algorithm commonly used with Illumina DNA sequencers. 

There are many very good algorithms for doing exact string matching. Given a query string of length 200, and a target string of length 3 billion (the human genome), we want to find any place in the target where there is a substring of length k that matches a substring of the query exactly. A simple approach is to begin by indexing the target: take all k-long substrings, put them in an array and sort them. Then take each k-long substring of the query and search the sorted index. Sort and search can be done in O(log n) time.

But storage can be a problem. An index of the 3 billion letter target would need to hold 3 billion pointers and 3 billion k-long words. It would seem hard to fit this in less than several tens of gigabytes of RAM. But amazingly we can greatly compress the index, using the Burrows-Wheeler transform, and it will still be efficiently queryable. An index of the human genome can fit in less than 4 GB RAM. This idea is the basis of popular sequence aligners such as Bowtie and BWA. 

Alternatively, we can use a suffix array, which stores only the pointers, yet represents a simultaneous index of all suffixes in the target string (essentially, a simultaneous index for all possible values of k; the same is true of the Burrows-Wheeler transform). A suffix array index of the human genome will take 12 GB of RAM if we use 32-bit pointers. 

The links above contain a wealth of information and links to primary research papers. The ELAND link goes to a PDF with useful figures illustrating the concepts involved, and shows how to deal with insertions and deletions.

Finally, while these algorithms have basically solved the problem of (re)sequencing single human genomes (a billion short strings), DNA sequencing technology improves even faster than Moore's law, and we are fast approaching trillion-letter datasets. For example, there are currently projects underway to sequence the genomes of 10,000 vertebrate species, each a billion letters long or so. Naturally, we will want to do pairwise inexact string matching on the data...
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     上一页
1
2
3
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复