python numpy pairwise edit-distance

不问归期 提交于 2019-12-07 05:46:31

问题


So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html

A sample of my array is as follows:

 >>> d[0:10]
 array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
   'GATTT', 'TCTTT', 'ACTTT'], 
  dtype='|S5')

However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:

 >>> import editdist
 >>> import scipy
 >>> import scipy.spatial
 >>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
    [X] = _copy_arrays_if_base_present([_convert_to_double(X)])
  File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
    X = np.double(X)
ValueError: could not convert string to float: TTTTT

回答1:


If you really must use pdist, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:

numeric_d = d.view(np.uint8).reshape((len(d),-1))

This simply views your array of strings as a long array of uint8 bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:

In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
       [65, 84, 84, 84, 84],
       [67, 84, 84, 84, 84],
       [71, 84, 84, 84, 84],
       [84, 65, 84, 84, 84],
       [65, 65, 84, 84, 84],
       [67, 65, 84, 84, 84],
       [71, 65, 84, 84, 84],
       [84, 67, 84, 84, 84],
       [65, 67, 84, 84, 84]], dtype=uint8)

Then, you can use pdist as you normally would. Just make sure that your editdist function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring():

def editdist(x, y):
  s1 = x.tostring()
  s2 = y.tostring()
  ... rest of function as before ...



回答2:


def my_pdist(data,f):
    N=len(data)
    matrix=np.empty([N*(N-1)/2])
    ind=0
    for i in range(N):
        for j in range(i+1,N):
            matrix[ind]=f(data[i],data[j])
            ind+=1
    return matrix


来源:https://stackoverflow.com/questions/24089973/python-numpy-pairwise-edit-distance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!