Can anyone suggest me on what data structure to use for a soundex algorithm program? The language to be used is Java. If anybody has worked on this before in Java. The program should have these features: be able to read about 50,000 words should be able to read a word and return the related words having the same soundex
I don't want the program implementation just few advices on what data structure to use.
TIP: If you use SQL as a databackend then you can let SQL handle it with the two sql-functions SOUNDEX and DIFFERENCE.
Maybe not what you wanted, but many people do not know that MSsql has those two functions.
Well soundex can be implemented in a straightforward pass over a string, so that doesn't require anything special.
After that the 4 character code can be treated as an integer key.
Then just build a dictionary that stores word sets indexed by that integer key. 50,000 words should easily fit into memory so nothing fancy is required.
Then walk the dictionary and each bucket is a group of similar sounding words.
Actually, here is the whole program in perl:
#!/usr/bin/perl
use Text::Soundex;
use Data::Dumper;
open(DICT,"</usr/share/dict/linux.words");
my %dictionary = ();
while (<DICT>) {
chomp();
chomp();
push @{$dictionary{soundex($_)}},$_;
}
close(DICT);
while (<>) {
my @words = split / +/;
foreach (@words) {
print Dumper $dictionary{soundex($_)};
}
}
I believe you just need to convert the original strings into soundex keys into a hashtable; the value for each entry in the table would be a collection of original strings mapping to that soundex.
The MultiMap collection interface (and its implementations) in Google Collections would be useful to you.
class SpellChecker
{
interface Hash {
String hash(String);
}
private final Hash hash;
private final Map<String, Set<String>> collisions;
SpellChecker(Hash hash) {
this.hash = hash;
collisions = new TreeSet<String, Set<String>>();
}
boolean addWord(String word) {
String key = hash.hash(word);
Set<String> similar = collisions.get(key);
if (similar == null)
collisions.put(key, similar = new TreeSet<String>());
return similar.add(word);
}
Set<String> similar(String word) {
Set<String> similar = collisions.get(hash.hash(word));
if (similar == null)
return Collections.emptySet();
else
return Collections.unmodifiableSet(similar);
}
}
The hash strategy could be Soundex, Metaphone, or what have you. Some strategies might be tunable (how many characters does it output, etc.)
Since soundex is a hash, I'd use a hash table, with the soundex as the key.
you want a 4-byte integer.
The soundex algorithm always returns a 4-character code, if you use ANSI inputs, you'll get 4-bytes back (represented as 4 letters).
So store the codes returned in a hashtable, convert your word to the code and look it up in the hashtable. Its really that easy.
来源:https://stackoverflow.com/questions/270771/data-structure-for-soundex-algorithm