Alternative to Levenshtein and Trigram

前端 未结 6 832
春和景丽
春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

6条回答
  •  不思量自难忘°
    2021-02-07 10:32

    Keyword Counting

    You haven't really defined why you think option one is a "closer" match, at least not in any algorithmic sense. It seems like you're basing your expectations on the notion that option one has more matching keywords than option two, so why not just match based on the number of keywords in each string?

    For example, using Ruby 2.0:

    string1 = 'Levi Watkins Learning Center - Alabama State University'
    string2 = 'ETH Library'
    strings = [str1, str2]
    
    keywords  = 'Alabama University'.split
    keycount  = {}
    
    # Count matching keywords in each string.
    strings.each do |str|
      keyword_hits  = Hash.new(0)
      keywords.each { |word| keyword_hits[word] += str.scan(/#{word}/).count }
      keyword_count = keyword_hits.values.reduce :+
      keycount[str] =  keyword_count
    end
    
    # Sort by keyword count, and print results.
    keycount.sort.reverse.map { |e| pp "#{e.last}: #{e.first}" }
    

    This will print:

    "2: Levi Watkins Learning Center - Alabama State University"
    "0: ETH Library"

    which matches your expectations of the corpus. You might want to make additional passes on the results using other algorithms to refine the results or to break ties, but this should at least get you pointed in the right direction.

提交回复
热议问题