Clustering Strings Based on Similar Word Sequences

后端 未结 2 1325
深忆病人
深忆病人 2021-01-24 04:24

I am looking for an efficient way to cluster about 10 million strings into clusters based on the appearance of similar word sequences.

Consider a list of strings like:<

相关标签:
2条回答
  • 2021-01-24 05:02

    Clustering is the wrong tool for you.

    For any unsupervised algorithm, the following partitioning will be as good:

    the fruit hut number one
    the ice cre am shop number one
    jim's taco outlet number one
    
    the ice cream shop
    the fruit hut
    jim's taco
    
    ice cream shop in the corner
    jim's t aco in the corner
    the fruit hut in the corner
    

    Because to a clustering algorithm "number one" and "in the corner" are also shared phrases. The second cluster are the leftovers.

    Use something supervised instead.

    0 讨论(0)
  • 2021-01-24 05:04

    I think you are looking for Near Duplicates Detection, with some unknown threshold you will use to cluster not only "near duplicates" - but also similar enough documents together.

    One of the known solutions to it is to use Jaccard-Similarity for getting the difference between two documents.

    Jaccard Similarity is basically - get sets of words from each document, let these sets be s1 and s2 - and the jaccard similarity is |s1 [intersection] s2|/|s1 [union] s2|.

    Usually when facing near duplicates - the order of words has some importance however. In order to deal with it - when generating the sets s1 and s2 - you actually generate sets of k-shinglings (or k-grams), instead sets of only words.
    In your example, with k=2, the sets will be: ice cream shop in the corner

    s2 = { the ice, ice cre, cre am, am shop, shop number, number one }
    s4 = {ice cream, cream shop, shop in, in the, the corner }
    s5 = { the ice, ice cream, cream shop }
    
    s4 [union] s5 = { ice cream, cream shop, shop in, in the, the corner, the ice } 
    s4 [intersection] s5 = { ice cream, cream shop }
    

    In the above, the jaccard-similarity will be 2/6.
    In your case maybe the ordinary k-shinglings will perform worse than using a single word (1-shingling), but you will have to test these approaches.

    This procedure can be scaled nicely to deal very efficiently with huge collections, without checking all pairs and creating huge numbers of sets. More details could be found in these lecture notes (I gave this lecture ~2 years ago, based on the author's notes).

    After you are done with this procedure, you basically have a measure d(s1,s2) that measures the distance between every two sentences, and you can use any known clustering algorithm to cluster them.


    Disclaimer: used my answer from this thread as basis for this after realizing near duplicates might fit here.

    0 讨论(0)
提交回复
热议问题