Ruby Anagram Using String#sum

让人想犯罪 __ 提交于 2019-12-06 11:57:39

Unfortunately I don't think String#sum is a robust way to solve this problem.

Consider:

"zaa".sum # => 316
"yab".sum # => 316

Same sum, but not anagrams.

Instead, how about grouping them by the sorted order of their characters?

words = %w[cars scar for four creams scream racs]

anagrams = words.group_by { |word| word.chars.sort }.values
# => [["cars", "scar", "racs"], ["for"], ["four"], ["creams", "scream"]] 
words = %w[cars scar for four creams scream racs]
res={}

words.each do |word|
  key=word.split('').sort.join
  res[key] ||= []
  res[key] << word
end

p res.values


[["cars", "scar", "racs"], ["for"], ["four"],["creams", "scream"]]

Actually, I think you could use sums for anagram testing, but not summing the chars' ordinals themselves, but something like this instead:

words = %w[cars scar for four creams scream racs]
# get the length of the longest word:
maxlen = words.map(&:length).max
# => 6 
words.group_by{|word|
  word.bytes.map{|b|
    maxlen ** (b-'a'.ord)
  }.inject(:+)
}
# => {118486616113189=>["cars", "scar", "racs"], 17005023616608=>["for"], 3673163463679584=>["four"], 118488792896821=>["creams", "scream"]} 

Not sure if this is 100% correct, but I think the logic stands.

The idea is to map every word to a N-based number, every digit position representing a different char. N is the length of the longest word in input set.

To get the desired output format, you just need hash.values. But note that just using the sum of the character codes in a word could fail on some inputs. It is possible for the sums of the character codes in two words to be the same by chance, when they are not anagrams.

If you used a different algorithm to combine the character codes, the chances of incorrectly identifying words as "anagrams" could be made much lower, but still not zero. Basically you need some kind of hash algorithm, but with the property that the order of the values being hashed doesn't matter. Perhaps map each character to a different random bitstring, and take the sum of the bitstrings for each character in the string?

That way, the chances of any two non-anagrams giving you a false positive would be approximately 2 ** bitstring_length.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!