发表新帖

发表新帖

Alternative to Levenshtein and Trigram

前端未结

关注

 6  832

春和景丽 2021-02-07 09:48

Say I have the following two strings in my database:

(1) \'Levi Watkins Learning Center - Alabama State University\'
(2) \'ETH Library\'

My sof

6条回答

不思量自难忘° (楼主)

2021-02-07 10:32
Keyword Counting

You haven't really defined why you think option one is a "closer" match, at least not in any algorithmic sense. It seems like you're basing your expectations on the notion that option one has more matching keywords than option two, so why not just match based on the number of keywords in each string?

For example, using Ruby 2.0:
```
string1 = 'Levi Watkins Learning Center - Alabama State University'
string2 = 'ETH Library'
strings = [str1, str2]

keywords  = 'Alabama University'.split
keycount  = {}

# Count matching keywords in each string.
strings.each do |str|
  keyword_hits  = Hash.new(0)
  keywords.each { |word| keyword_hits[word] += str.scan(/#{word}/).count }
  keyword_count = keyword_hits.values.reduce :+
  keycount[str] =  keyword_count
end

# Sort by keyword count, and print results.
keycount.sort.reverse.map { |e| pp "#{e.last}: #{e.first}" }
```
This will print:

"2: Levi Watkins Learning Center - Alabama State University"
"0: ETH Library"

which matches your expectations of the corpus. You might want to make additional passes on the results using other algorithms to refine the results or to break ties, but this should at least get you pointed in the right direction.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题