问题
Basically, Anagrams are like permutation of string.E.g stack
,sackt
,stakc
all are anagrams of stack
(thought above words aren't meaningful). Anyways you could have understood what I basically meant.
Now, I want a list of anagrams
given million words or simply say from a dictionary.
My basic question is Find total number of unique anagrams in a dictionary?
Sorting and comparing won't work as it's time complexity is pretty bad.
I thought of using hash table, string as key.
But the problem is what should be the hash function ? It would be helpful if some pseudocode provided. Some other approaches better than mentioned approaches would also be helpful.
Thanks.
回答1:
The obvious solution is to map each character to a prime number and multiply the prime numbers. So if 'a'' -> 2 and 'b' -> 3, then
- 'ab' -> 6
- 'ba' -> 6
- 'bab' -> 18
- 'abba' -> 36
- 'baba' -> 36
To minimise the chance of overflow, the smallest primes could be assigned to the more frequent letters (e,t,i,a,n). Note: The 26th prime is 101.
UPDATE: an implementation can be found here
回答2:
One possible hash function could be (assuming english words only) a sorted count of the number of occurrences of each letter. So for "anagram" you would generate [('a', 3), ('g', 1), ('n', 1), ('m', 1), ('r',1)].
Alternatively you could get an inexact grouping by generating a bitmask from your word where for bits 0-25 each bit represented the presence or absence of that letter (bit 0 representing 'a' through to bit 25 representining 'z'). But then you'd have to do a bit more processing to split each hashed group further to distinguish e.g. "to" from "too".
Do either of these ideas help? Any particular implementation language in mind (I could do C++, python or Scala)?
Edit: added some example Scala code and output:
OK: I'm in Scala mode at the moment, so I've knocked something up to do what you ask, but (ahem) it may not be very clear if you're not that familiar with Scala or functional programming.
Using a big list of english words from here: http://scrapmaker.com/data/wordlists/twelve-dicts/2of12.txt
I run this Scala code on them (takes about 5 seconds using Scala 2.9 in script mode, including time to compile, with a dictionary of about 40,000 words. Not the most efficient code, but the first thing that came to mind).
// Hashing function to go from a word to a sorted list of letter counts
def toHash(b:String) = b.groupBy(x=>x).map(v => (v._1, v._2.size) ).toList.sortWith(_._1 < _._1)
// Read all words from file, one word per line
val lines = scala.io.Source.fromFile("2of12.txt").getLines
// Go from list of words to list of (hashed word, word)
val hashed = lines.map( l => (toHash(l), l) ).toList
// Group all the words by hash (hence group all anagrams together)
val grouped = hashed.groupBy( x => x._1 ).map( els => (els._1, els._2.map(_._2)) )
// Sort the resultant anagram sets so the largest come first
val sorted = grouped.toList.sortWith( _._2.size > _._2.size )
for ( set <- sorted.slice(0, 10) )
{
println( set._2 )
}
This dumps out the first 10 sets of anagrams (sets with the most members first) being:
List(caret, cater, crate, react, trace)
List(reins, resin, rinse, risen, siren)
List(luster, result, rustle, sutler, ulster)
List(astir, sitar, stair, stria, tarsi)
List(latrine, ratline, reliant, retinal)
List(caper, crape, pacer, recap)
List(merit, miter, remit, timer)
List(notes, onset, steno, stone)
List(lair, liar, lira, rail)
List(drawer, redraw, reward, warder)
Note that this uses the first suggestion (list of counts of letters) not the more complicated bitmask method.
Edit 2: You can replace the hash function with a simple sort on the chars of each word (as suggested by JAB) and get the same result with clearer/faster code:
def toHash(b:String) = b.toList.sortWith(_<_)
回答3:
If you XOR the hash-code values of each character, and then XOR the result by the input length, you will get the same value regardless of the order of the word, meaning that all anagrams will produce the same hash. (XORing by the length prevents 'boss' and 'bo' from returning the same value, because the hash of the 's' against itself is always 0.)
Example:
int AnagramHash(string input)
{
int output = 0;
foreach(char c in input)
output ^= c.GetHashCode();
return output ^ input.Length;
}
You will still have to search for all words with the same AnagramHash. I would update the dictionary table with a field for the hash (regardless of your algorithm) to reduce overall computation.
EDIT: Also, as a side note, XOR is the simplest operation performed by the ALU so if you do end up using it, you should be able to generate your hashes fairly quickly.
回答4:
Sorting and comparing won't work as it's time complexity is pretty bad.
Exchanging time complexity for extra memory, just store the counts of the letters in a word in a 26-char
(or the equivalent in whatever language you're using, and assuming you're using the Roman alphabet and only alphabetic characters) array and hash the array. You're stuck with O(n) time relative to word length, but most English words aren't really that long.
e.g. stack
, sackt
, and stakc
would all have an array with the locations for s
, t
, a
, c
, k
== 1 and the rest all set to 0.
Based on your comment, which implies that you are indeed okay with sorting the characters of a word as long as you aren't sorting words themselves, you could do something even simpler than Alex's answer and just sort the characters in the word strings and hash the results. (larsmans said it first, but didn't post it as an answer, so...)
回答5:
Use a hashmap with string as key and list(string) as value where list of strings contain all anagrams of a key string.
The question is similar to "find all anagrams of a word in a file"
View algo and code here http://justprogrammng.blogspot.com/2012/06/determine-anagrams-of-word-in-file.html
来源:https://stackoverflow.com/questions/11108541/get-list-of-anagrams-from-a-dictionary