I\'m trying to analyze a bunch of search terms, so many that individually they don\'t tell much. That said, I\'d like to group the terms because I think similar terms shoul
You'll want to cluster these terms, and for the similarity metric I recommend Dice's Coefficient at the character-gram level. For example, partition the strings into two-letter sequences to compare (term1="NB", "BA", "A ", " B", "Ba"...).
nltk appears to provide dice as nltk.metrics.association.BigramAssocMeasures.dice(), but it's simple enough to implement in a way that'll allow tuning. Here's how to compare these strings at the character rather than word level.
import sys, operator
def tokenize(s, glen):
g2 = set()
for i in xrange(len(s)-(glen-1)):
g2.add(s[i:i+glen])
return g2
def dice_grams(g1, g2): return (2.0*len(g1 & g2)) / (len(g1)+len(g2))
def dice(n, s1, s2): return dice_grams(tokenize(s1, n), tokenize(s2, n))
def main():
GRAM_LEN = 4
scores = {}
for i in xrange(1,len(sys.argv)):
for j in xrange(i+1, len(sys.argv)):
s1 = sys.argv[i]
s2 = sys.argv[j]
score = dice(GRAM_LEN, s1, s2)
scores[s1+":"+s2] = score
for item in sorted(scores.iteritems(), key=operator.itemgetter(1)):
print item
When this program is run with your strings, the following similarity scores are produced:
./dice.py "NBA Basketball" "Basketball NBA" "Basketball" "Baseball"
('NBA Basketball:Baseball', 0.125)
('Basketball NBA:Baseball', 0.125)
('Basketball:Baseball', 0.16666666666666666)
('NBA Basketball:Basketball NBA', 0.63636363636363635)
('NBA Basketball:Basketball', 0.77777777777777779)
('Basketball NBA:Basketball', 0.77777777777777779)
At least for this example, the margin between the basketball and baseball terms should be sufficient for clustering them into separate groups. Alternatively you may be able to use the similarity scores more directly in your code with a threshold.