What does the Brown clustering algorithm output mean?

前端 未结 5 799
暗喜
暗喜 2020-12-25 15:05

I\'ve ran the brown-clustering algorithm from https://github.com/percyliang/brown-cluster and also a python implementation https://github.com/mheilman/tan-clustering. And th

相关标签:
5条回答
  • 2020-12-25 15:21

    If I understand correctly, the algorithm gives you a tree and you need to truncate it at some level to get clusters. In case of those bit strings, you should just take first L characters.

    For example, cutting at the second character gives you two clusters

    10           chased     
    
    11           dog        
    11           mouse      
    11           cat        
    

    At the third character you get

    110           dog        
    
    111           mouse      
    111           cat        
    

    The cutting strategy is a different subject though.

    0 讨论(0)
  • My guess is:

    According to Figure 2 in Brown et al 1992, the clustering is hierarchical and to get from the root to each word "leaf" you have to make an up/down decision. If up is 0 and down is 1, you can represent each word as a bit string.

    From https://github.com/mheilman/tan-clustering/blob/master/class_lm_cluster.py :

    # the 0/1 bit to add when walking up the hierarchy
    # from a word to the top-level cluster
    
    0 讨论(0)
  • 2020-12-25 15:36

    In Percy Liang's implementation (https://github.com/percyliang/brown-cluster), the -C parameter allows you to specify the number of word clusters. The output contains all the words in the corpus, together with a bit-string annotating the cluster and the word frequency in the following format: <bit string> <word> <word frequency>. The number of distinct bit strings in the output equals the number of desired clusters and the words with the same bit string belong to the same cluster.

    0 讨论(0)
  • 2020-12-25 15:38

    The integers are counts of how many times the word is seen in the document. (I have tested this in the python implementation.)

    From the comments at the top of the python implementation:

    Instead of using a window (e.g., as in Brown et al., sec. 4), this code computed PMI using the probability that two randomly selected clusters from the same document will be c1 and c2. Also, since the total numbers of cluster tokens and pairs are constant across pairs, this code use counts instead of probabilities.

    From the code in the python implementation we see that it outputs the word, the bit string and the word counts.

    def save_clusters(self, output_path):
        with open(output_path, 'w') as f:
            for w in self.words:
                f.write("{}\t{}\t{}\n".format(w, self.get_bitstring(w),
                                              self.word_counts[w]))
    
    0 讨论(0)
  • 2020-12-25 15:42

    Change your running : ./wcluster --text input.txt --c 3

    --c number

    this number means the number of cluster, and the default is 50. You can't distinguish the different cluster of words because the default input has only three sentences. Change 50 clusters to 3 clusters and you can tell the difference.

    I enter three tweets into the input and give 3 as the cluster parameter

    0 讨论(0)
提交回复
热议问题