clustering words based on their char set

问题

Say there is a word set and I would like to clustering them based on their char bag (multiset). For example

{tea, eat, abba, aabb, hello}

will be clustered into

{{tea, eat}, {abba, aabb}, {hello}}.

abba and aabb are clustered together because they have the same char bag, i.e. two a and two b.

To make it efficient, a naive way I can think of is to covert each word into a char-cnt series, for exmaple, abba and aabb will be both converted to a2b2, tea/eat will be converted to a1e1t1. So that I can build a dictionary and group words with same key.

Two issues here: first I have to sort the chars to build the key; second, the string key looks awkward and performance is not as good as char/int keys.

Is there a more efficient way to solve the problem?

回答1:

For detecting anagrams you can use a hashing scheme based on the product of prime numbers A->2, B->3, C->5 etc. will give "abba" == "aabb" == 36 (but a different letter to primenumber mapping will be better) See my answer here.

回答2:

Since you are going to sort words, I assume all characters ascii values are in the range 0-255. Then you can do a Counting Sort over the words.

The counting sort is going to take the same amount of time as the size of the input word. Reconstruction of the string obtained from counting sort will take O(wordlen). You cannot make this step less than O(wordLen) because you will have to iterate the string at least once ie O(wordLen). There is no predefined order. You cannot make any assumptions about the word without iterating though all the characters in that word. Traditional sorting implementations(ie comparison based ones) will give you O(n * lg n). But non comparison ones give you O(n).

Iterate over all the words of the list and sort them using our counting sort. Keep a map of sorted words to the list of known words they map. Addition of elements to a list takes constant time. So overall the complexity of the algorithm is O(n * avgWordLength).

Here is a sample implementation

import java.util.ArrayList;


public class ClusterGen {

    static String sortWord(String w) {
        int freq[] = new int[256];

        for (char c : w.toCharArray()) {
            freq[c]++;
        }
        StringBuilder sortedWord = new StringBuilder();
        //It is at most O(n)
        for (int i = 0; i < freq.length; ++i) {
            for (int j = 0; j < freq[i]; ++j) {
                sortedWord.append((char)i);
            }
        }
        return sortedWord.toString();
    }

    static Map<String, List<String>> cluster(List<String> words) {
        Map<String, List<String>> allClusters = new HashMap<String, List<String>>();

        for (String word : words) {
            String sortedWord = sortWord(word);
            List<String> cluster = allClusters.get(sortedWord);
            if (cluster == null) {
                cluster = new ArrayList<String>();
            }
            cluster.add(word);
            allClusters.put(sortedWord, cluster);
        }

        return allClusters;
    }

    public static void main(String[] args) {
        System.out.println(cluster(Arrays.asList("tea", "eat", "abba", "aabb", "hello")));
        System.out.println(cluster(Arrays.asList("moon", "bat", "meal", "tab", "male")));

    }
}

Returns

{aabb=[abba, aabb], ehllo=[hello], aet=[tea, eat]}
{abt=[bat, tab], aelm=[meal, male], mnoo=[moon]}

回答3:

Using an alphabet of x characters and a maximum word length of y, you can create hashes of (x + y) bits such that every anagram has a unique hash. A value of 1 for a bit means there is another of the current letter, a value of 0 means to move on to the next letter. Here's an example showing how this works:

Let's say we have a 7 letter alphabet(abcdefg) and a maximum word length of 4. Every word hash will be 11 bits. Let's hash the word "fade": 10001010100

The first bit is 1, indicating there is an a present. The second bit indicates that there are no more a's. The third bit indicates that there are no more b's, and so on. Another way to think about this is the number of ones in a row represents the number of that letter, and the total zeroes before that string of ones represents which letter it is.

Here is the hash for "dada": 11000110000

It's worth noting that because there is a one-to-one correspondence between possible hashes and possible anagrams, this is the smallest possible hash guaranteed to give unique hashes for any input, which eliminates the need to check everything in your buckets when you are done hashing.

I'm well aware that using large alphabets and long words will result in a large hash size. This solution is geared towards guaranteeing unique hashes in order to avoid comparing strings. If you can design an algorithm to compute this hash in constant time(given you know the values of x and y) then you'll be able to solve the entire grouping problem in O(n).

回答4:

I would do this in two steps, first sort all your words according to their length and work on each subset separately(this is to avoid lots of overlaps later.)

The next step is harder and there are many ways to do it. One of the simplest would be to assign every letter a number(a = 1, b = 2, etc. for example) and add up all the values for each word, thereby assigning each word to an integer. Then you can sort the words according to this integer value which drastically cuts the number you have to compare.

Depending on your data set you may still have a lot of overlaps("bad" and "cac" would generate the same integer hash) so you may want to set a threshold where if you have too many words in one bucket you repeat the previous step with another hash(just assigning different numbers to the letters) Unless someone has looked at your code and designed a wordlist to mess you up, this should cut the overlaps to almost none.

Keep in mind that this approach will be efficient when you are expecting small numbers of words to be in the same char bag. If your data is a lot of long words that only go into a couple char bags, the number of comparisons you would do in the final step would be astronomical, and in this case you would be better off using an approach like the one you described - one that has no possible overlaps.

回答5:

One thing I've done that's similar to this, but allows for collisions, is to sort the letters, then get rid of duplicates. So in your example, you'd have buckets for "aet", "ab", and "ehlo".

Now, as I say, this allows for collisions. So "rod" and "door" both end up in the same bucket, which may not be what you want. However, the collisions will be a small set that is easily and quickly searched.

So once you have the string for a bucket, you'll notice you can convert it into a 32-bit integer (at least for ASCII). Each letter in the string becomes a bit in a 32-bit integer. So "a" is the first bit, "b" is the second bit, etc. All (English) words make a bucket with a 26-bit identifier. You can then do very fast integer compares to find the bucket a new words goes into, or find the bucket an existing word is in.

回答6:

Count the frequency of characters in each of the strings then build a hash table based on the frequency table. so for an example, for string aczda and aacdz we get 20110000000000000000000001. Using hash table we can partition all these strings in buckets in O(N).

回答7:

26-bit integer as a hash function

If your alphabet isn't too large, for instance, just lower case English letters, you can define this particular hash function for each word: a 26 bit integer where each bit represents whether that English letter exists in the word. Note that two words with the same char set will have the same hash.

Then just add them to a hash table. It will automatically be clustered by hash collisions.

It will take O(max length of the word) to calculate a hash, and insertion into a hash table is constant time. So the overall complexity is O(max length of a word * number of words)

来源：https://stackoverflow.com/questions/18167922/clustering-words-based-on-their-char-set

标签

algorithm

anagram