Fast fuzzy/approximate search in dictionary of strings in Ruby

后端 未结 4 671
渐次进展
渐次进展 2021-02-08 23:55

I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some \"edit\" distance toleranc

相关标签:
4条回答
  • 2021-02-09 00:41

    Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification named "Wilbur-Khovayko algorithm".

    Basic idea: To split strings by trigrams, and search maximal intersection scores.

    For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", eand so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

    Thereafter, for each trigram built index, in which term it is present.

    So, this is list of term_ID for each trigram.

    When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

    It works quick: I remember, on old Sun/solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

    You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tar.gz

    UPDATE:

    I created new archive, where is Makefile adjusted for modern Linux/BSD make. You can download new version here: http://olegh.ftp.sh/wilbur-khovayko.tgz

    Make some directory, and extract archive here:

    mkdir F2
    cd F2
    tar xvfz wilbur-khovayko.tgz
    make
    

    Go to test directory, copy term list file (this is fixed name, termlist.txt), and make index:

     cd test/
     cp /tmp/test/termlist.txt ./termlist.txt
     ./crefdb.exe <termlist.txt
    

    In this test, I used ~380,000 expired domain names:

    wc -l termlist.txt
    379430 termlist.txt
    

    Run findtest application:

    ./findtest.exe
    
    boking  <-- this is query -- word "booking" with misspeling
    
    
    0001:Query: [boking]
      1:  287890 (  3.863739) [bokintheusa.com,2009-11-20,$69]
      2:  287906 (  3.569148) [bookingseu.com,2009-11-20,$69]
      3:  257170 (  3.565942) [bokitko.com,2009-11-18,$69]
      4:  302830 (  3.413791) [bookingcenters.com,2009-11-21,$69]
      5:  274658 (  3.408325) [bookingsadept.com,2009-11-19,$69]
      6:  100438 (  3.379371) [bookingresorts.com,2009-11-09,$69]
      7:  203401 (  3.363858) [bookinginternet.com,2009-11-15,$69]
      8:  221222 (  3.361689) [bobokiosk.com,2009-11-16,$69]
      . . . . 
     97:   29035 (  2.169753) [buccupbooking.com,2009-11-05,$69]
     98:  185692 (  2.169047) [box-hosting.net,2009-11-14,$69]
     99:  345394 (  2.168371) [birminghamcookinglessons.com,2009-11-25,$69]
    100:  150134 (  2.167372) [bowlingbrain.com,2009-11-12,$69]
    
    0 讨论(0)
  • 2021-02-09 00:43

    If you are prepared to get involved with Machine Learning approaches, then this paper by Geoff Hinton will be a good starting point

    http://www.cs.toronto.edu/~hinton/absps/sh.pdf

    These kind of approaches are used in places like Google etc.

    Essentially you cluster your dictionary strings based on similarity. When the query string comes, instead of calculating the edit distance against the entire data set, you just compare the cluster thus reducing query time significantly.

    P.S I did a bit of googling, found a Ruby implementation of another similar approach called Locality Sensitive Hashing here https://github.com/bbcrd/ruby-lsh

    0 讨论(0)
  • 2021-02-09 00:46

    Here is raw Trie-like implementation. It is totally not optimized, just a proof of concept. Pure Ruby implementation.

    To test it I took 100_000 words from here http://www.infochimps.com/datasets/word-list-100000-official-crossword-words-excel-readable/downloads/195488

    here is a gist for it https://gist.github.com/fl00r/7542994

    class TrieDict
      attr_reader :dict
    
      def initialize
        @dict = {}
      end
    
      def put(str)
        d = nil
        str.chars.each do |c|
          d && (d = (d[1][c] ||= [nil, {}])) || d = (@dict[c] ||= [nil, {}])
        end
        d[0] = true
      end
    
      def fetch(prefix, fuzzy = 0)
        storage = []
        str = ""
        error = 0
        recur_fetch(prefix, fuzzy, @dict, storage, str, error)
        storage
      end
    
      def recur_fetch(prefix, fuzzy, dict, storage, str, error)
        dict.each do |k, vals|
          e = error
          if prefix[0] != k
            e += 1
            next  if e > fuzzy
          end
          s = str + k
          storage << s  if vals[0] && (prefix.size - 1) <= (fuzzy - e)
          recur_fetch(prefix[1..-1] || "", fuzzy, vals[1], storage, s, e)
        end
      end
    end
    
    def bench
      t = Time.now.to_f
      res = nil
      10.times{ res = yield }
      e = Time.now.to_f - t
      puts "Elapsed for 10 times: #{e}"
      puts "Result: #{res}"
    end
    
    trie = TrieDict.new
    File.read("/home/petr/code/tmp/words.txt").each_line do |word|
      trie.put(word.strip)
    end; :ok
    # Elapsed for 10 times: 0.0006465911865234375
    # Result: ["hello"]
    bench{ trie.fetch "hello", 1 }
    # Elapsed for 10 times: 0.013643741607666016
    # Result: ["cello", "hallo", "helio", "hell", "hello", "hellos", "hells", "hillo", "hollo", "hullo"]
    bench{ trie.fetch "hello", 2 }
    # Elapsed for 10 times: 0.08267641067504883
    # Result: ["bell", "belle", "bellow", "bells", "belly", "cell", "cella", "celli", "cello", "cellos", "cells", "dell", "dells", "delly", "fell", "fella", "felloe", "fellow", "fells", "felly", "hall", "hallo", "halloa", "halloo", "hallos", "hallot", "hallow", "halls", "heal", "heals", "heel", "heels", "heil", "heils", "held", "helio", "helios", "helix", "hell", "helled", "heller", "hello", "helloed", "helloes", "hellos", "hells", "helm", "helms", "helot", "help", "helps", "helve", "herl", "herls", "hill", "hillo", "hilloa", "hillos", "hills", "hilly", "holla", "hollo", "holloa", "holloo", "hollos", "hollow", "holly", "hull", "hullo", "hulloa", "hullos", "hulls", "jell", "jells", "jelly", "mell", "mellow", "mells", "sell", "selle", "sells", "tell", "tells", "telly", "well", "wells", "yell", "yellow", "yells"]
    bench{ trie.fetch "engineer", 2 }
    # Elapsed for 10 times: 0.04654884338378906
    # Result: ["engender", "engine", "engined", "engineer", "engineered", "engineers", "enginery", "engines"]
    bench{ trie.fetch "engeneer", 1 }
    # Elapsed for 10 times: 0.005484580993652344
    # Result: ["engender", "engineer"]
    
    0 讨论(0)
  • 2021-02-09 00:53

    I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching. Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

    Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

    • first ask either gem for a set of best matches trigrams-wise
    • then compare results with your input string, using Levenstein
    • and return the min for that measure.

    In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

    MyRecords.find_by_fuzzy_name(input_string).select { |result|
      Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
    }
    

    This performas a handful of well optimized database queries and a few

    Caveats:

    • if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
    • using trigrams assumes your input text is latin text or close to (european languages basically).
    • there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".
    0 讨论(0)
提交回复
热议问题