So, I have a numpy array of strings, and I want to calculate the pairwise edit-distance between each pair of elements using this function: scipy.spatial.distance.pdist from http://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.spatial.distance.pdist.html
A sample of my array is as follows:
>>> d[0:10]
array(['TTTTT', 'ATTTT', 'CTTTT', 'GTTTT', 'TATTT', 'AATTT', 'CATTT',
'GATTT', 'TCTTT', 'ACTTT'],
dtype='|S5')
However, since it doesn't have the 'editdistance' option, therefore, I want to give a customized distance function. I tried this and I faced the following error:
>>> import editdist
>>> import scipy
>>> import scipy.spatial
>>> scipy.spatial.distance.pdist(d[0:10], lambda u,v: editdist.distance(u,v))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 1150, in pdist
[X] = _copy_arrays_if_base_present([_convert_to_double(X)])
File "/usr/local/epd-7.3.2/lib/python2.7/site-packages/scipy/spatial/distance.py", line 153, in _convert_to_double
X = np.double(X)
ValueError: could not convert string to float: TTTTT
If you really must use pdist
, you first need to convert your strings to numeric format. If you know that all strings will be the same length, you can do this rather easily:
numeric_d = d.view(np.uint8).reshape((len(d),-1))
This simply views your array of strings as a long array of uint8
bytes, then reshapes it such that each original string is on a row by itself. In your example, this would look like:
In [18]: d.view(np.uint8).reshape((len(d),-1))
Out[18]:
array([[84, 84, 84, 84, 84],
[65, 84, 84, 84, 84],
[67, 84, 84, 84, 84],
[71, 84, 84, 84, 84],
[84, 65, 84, 84, 84],
[65, 65, 84, 84, 84],
[67, 65, 84, 84, 84],
[71, 65, 84, 84, 84],
[84, 67, 84, 84, 84],
[65, 67, 84, 84, 84]], dtype=uint8)
Then, you can use pdist
as you normally would. Just make sure that your editdist
function is expecting arrays of integers, rather than strings. You could quickly convert your new inputs by calling .tostring()
:
def editdist(x, y):
s1 = x.tostring()
s2 = y.tostring()
... rest of function as before ...
def my_pdist(data,f):
N=len(data)
matrix=np.empty([N*(N-1)/2])
ind=0
for i in range(N):
for j in range(i+1,N):
matrix[ind]=f(data[i],data[j])
ind+=1
return matrix
来源:https://stackoverflow.com/questions/24089973/python-numpy-pairwise-edit-distance