Fast fuzzy/approximate search in dictionary of strings in Ruby

后端 未结 4 676
渐次进展
渐次进展 2021-02-08 23:55

I have a dictionary of 50K to 100K strings (can be up to 50+ characters) and I am trying to find whether a given string is in the dictionary with some \"edit\" distance toleranc

4条回答
  •  我寻月下人不归
    2021-02-09 00:53

    I wrote a pair of gems, fuzzily and blurrily which do trigrams-based fuzzy matching. Given your (low) volume of data Fuzzily will be easier to integrate and about as fast, in with either you'd get answers within 5-10ms on modern hardware.

    Given both are trigrams-based (which is indexable), not edit-distance-based (which isn't), you'd probably have to do this in two passes:

    • first ask either gem for a set of best matches trigrams-wise
    • then compare results with your input string, using Levenstein
    • and return the min for that measure.

    In Ruby (as you asked), using Fuzzily + the Text gem, obtaining the records withing the edit distance threshold would look like:

    MyRecords.find_by_fuzzy_name(input_string).select { |result|
      Text::Levenshtein.distance(input_string, result.name)] < my_distance_threshold
    }
    

    This performas a handful of well optimized database queries and a few

    Caveats:

    • if the "minimal" edit distance you're looking for is high, you'll still be doing lots of Levenshteins.
    • using trigrams assumes your input text is latin text or close to (european languages basically).
    • there probably are edge cases since nothing garantees that "number of matching trigrams" is a great general approximation to "edit distance".

提交回复
热议问题