Python: Jaccard Distance using word intersection but not character intersection

问题

I didn't realize the that Python set function actually separating string into individual characters. I wrote python function for Jaccard and used python intersection method. I passed two sets into this method and before passing the two sets into my jaccard function I use the set function on the setring.

example: assume I have string NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg i would call set(NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg) which will separate string into characters. So when I send it to jaccard function intersection actually look character intersection instead of word to word intersection. How can I do word to word intersection.

#implementing jaccard
def jaccard(a, b):
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

if I don't call set function on my string NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg I get the following error:

    c = a.intersection(b)
AttributeError: 'str' object has no attribute 'intersection'

Instead of character to character intersection I want to do word to word intersection and get the jaccard similarity.

回答1:

Try splitting your string into words first:

word_set = set(your_string.split())

Example:

>>> word_set = set("NEW Fujifilm 16MP 5x".split())
>>> character_set = set("NEW Fujifilm 16MP 5x")
>>> word_set
set(['NEW', '16MP', '5x', 'Fujifilm'])
>>> character_set
set([' ', 'f', 'E', 'F', 'i', 'M', 'j', 'm', 'l', 'N', '1', 'P', 'u', 'x', 'W', '6', '5'])

回答2:

My function to calculate Jaccard distance:

def DistJaccard(str1, str2):
    str1 = set(str1.split())
    str2 = set(str2.split())
    return float(len(str1 & str2)) / len(str1 | str2)

>>> DistJaccard("hola amigo", "chao amigo")
0.333333333333

回答3:

This property is not unique to sets:

>>> list('NEW Fujifilm')
['N', 'E', 'W', ' ', 'F', 'u', 'j', 'i', 'f', 'i', 'l', 'm']

What is going on here is that the string is being treated as an iterable sequence and being processed character by character.

Same thing you are seeing with set:

>>> set('string')
set(['g', 'i', 'n', 's', 'r', 't'])

To fix, use .add() on an existing set, since .add() does not use an interable:

>>> se=set()
>>> se.add('NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg')
>>> se
set(['NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'])

Or, use split(), a tuple, a list, or some alternate iterable so the string is not treated as an iterable:

>>> set('something'.split())
set(['something'])
>>> set(('something',))
set(['something'])
>>> set(['something'])
set(['something'])

Add more elements based on your string on a word-by-word basis:

>>> se=set(('Something',)) | set('NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'.split())

Or, if you need a comprehension for some logic as you add to the set:

>>> se={w for w in 'NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'.split() 
         if len(w)>3}
>>> se
set(['Shoot', 'CAMERA', 'Point', 'screen.jpg', 'Zoom', 'Fujifilm', '16MP', 'Optical'])

And it work how you expect now:

>>> 'Zoom' in se
True
>>> s1=set('NEW Fujifilm 16MP 5x Optical Zoom Point and Shoot CAMERA 2 7 screen.jpg'.split())
>>> s2=set('Fujifilm Optical Zoom CAMERA NONE'.split())
>>> s1.intersection(s2)
set(['Optical', 'CAMERA', 'Zoom', 'Fujifilm'])

回答4:

This is the one I wrote based on set function -

def jaccard(a,b):
    a=a.split()
    b=a.split()
    union = list(set(a+b))
    intersection = list(set(a) - (set(a)-set(b)))
    print "Union - %s" % union
    print "Intersection - %s" % intersection
    jaccard_coeff = float(len(intersection))/len(union)
    print "Jaccard Coefficient is = %f " % jaccard_coeff

来源：https://stackoverflow.com/questions/11911252/python-jaccard-distance-using-word-intersection-but-not-character-intersection

标签

python

set

intersection