I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.
Consider a list of strings like:<
Clustering is the wrong tool for you.
For any unsupervised algorithm, the following partitioning will be as good:
the fruit hut number one
the ice cre am shop number one
jim's taco outlet number one
the ice cream shop
the fruit hut
jim's taco
ice cream shop in the corner
jim's t aco in the corner
the fruit hut in the corner
Because to a clustering algorithm "number one" and "in the corner" are also shared phrases. The second cluster are the leftovers.
Use something supervised instead.
I think you are looking for Near Duplicates Detection, with some unknown threshold you will use to cluster not only "near duplicates" - but also similar enough documents together.
One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.
Jaccard Similarity is basically - get sets of words from each document, let these sets be s1
and s2
- and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|
.
Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1
and s2
- you actually generate sets of k-shinglings (or k-grams), instead sets of only words.
In your example, with k=2
, the sets will be:
ice cream shop in the corner
s2 = { the ice, ice cre, cre am, am shop, shop number, number one }
s4 = {ice cream, cream shop, shop in, in the, the corner }
s5 = { the ice, ice cream, cream shop }
s4 [union] s5 = { ice cream, cream shop, shop in, in the, the corner, the ice }
s4 [intersection] s5 = { ice cream, cream shop }
In the above, the jaccard-similarity will be 2/6
.
In your case maybe the ordinary k-shinglings will perform worse than using a single word (1-shingling), but you will have to test these approaches.
This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture ~2 years ago, based on the author's notes).
After you are done with this procedure, you basically have a measure d(s1,s2)
that measures the distance between every two sentences, and you can use any known clustering algorithm to cluster them.
Disclaimer: used my answer from this thread as basis for this after realizing near duplicates might fit here.