Grouping Similar Strings

前端 未结 1 400
说谎
说谎 2021-01-19 13:33

I\'m trying to analyze a bunch of search terms, so many that individually they don\'t tell much. That said, I\'d like to group the terms because I think similar terms shoul

相关标签:
1条回答
  • 2021-01-19 14:03

    You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).

    nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.

    import sys, operator
    
    def tokenize(s, glen):
      g2 = set()
      for i in xrange(len(s)-(glen-1)):
        g2.add(s[i:i+glen])
      return g2
    
    def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
    
    def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
    
    def main():
      GRAM_LEN = 4
      scores = {}
      for i in xrange(1,len(sys.argv)):
        for j in xrange(i+1, len(sys.argv)):
          s1 = sys.argv[i]
          s2 = sys.argv[j]
          score = dice(GRAM_LEN, s1, s2)
          scores[s1+":"+s2] = score
      for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
        print item
    

    When this program is run with your strings, the following similarity scores are produced:

    ./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
    
    ('NBA Basketball:Baseball', 0.125)
    ('Basketball NBA:Baseball', 0.125)
    ('Basketball:Baseball', 0.16666666666666666)
    ('NBA Basketball:Basketball NBA', 0.63636363636363635)
    ('NBA Basketball:Basketball', 0.77777777777777779)
    ('Basketball NBA:Basketball', 0.77777777777777779)
    

    At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.

    0 讨论(0)
提交回复
热议问题