Clustering Strings Based on Similar Word Sequences

后端 未结 2 1326
深忆病人
深忆病人 2021-01-24 04:24

I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.

Consider a list of strings like:<

2条回答
  •  后悔当初
    2021-01-24 05:04

    I think you are looking for Near Duplicates Detection, with some unknown threshold you will use to cluster not only "near duplicates" - but also similar enough documents together.

    One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

    Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

    Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings (or k-grams), instead sets of only words.
    In your example, with k=2, the sets will be: ice cream shop in the corner

    s2 = { the ice, ice cre, cre am, am shop, shop number, number one }
    s4 = {ice cream, cream shop, shop in, in the, the corner }
    s5 = { the ice, ice cream, cream shop }
    
    s4 [union] s5 = { ice cream, cream shop, shop in, in the, the corner, the ice } 
    s4 [intersection] s5 = { ice cream, cream shop }
    

    In the above, the jaccard-similarity will be 2/6.
    In your case maybe the ordinary k-shinglings will perform worse than using a single word (1-shingling), but you will have to test these approaches.

    This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture ~2 years ago, based on the author's notes).

    After you are done with this procedure, you basically have a measure d(s1,s2) that measures the distance between every two sentences, and you can use any known clustering algorithm to cluster them.


    Disclaimer: used my answer from this thread as basis for this after realizing near duplicates might fit here.

提交回复
热议问题