Compute Edit distance for a dataframe which has only column and multiple rows in python

问题

I have a dataframe which has one column and more that 2000 rows. How to compute the edit distance between each rows of the same column.

My Dataframe looks like this:

  Name
  John
  Mrinmayee
  rituja
  ritz
  divya
  priyanka
  chetna
  chetan
  mansi
  mansvi
  mani
  aliya
  shelia
  Dilip
  Dilipa

I need to calculate distance between each and every row ? How can we do this or achieve this.

I have written some code but that doesnot work this .. gives and enndless list of distances I guess I am going wrong in for loop. can somebody help please

   import pandas as pd
   import numpy as np
   import editdistance
   data_dist =  pd.read_csv(Data_TestDescription.csv')
   df = pd.DataFrame(data_dist)
   levdist =[]
   for index, row in df.iterrows():
        levdist = editdistance.eval(row,row)
        print levdist

回答1:

This is a neat trick I learned courtesy Adirio. You can use itertools.product, and then calculate edit distance in a loop.

from itertools import product

dist = np.empty(df.shape[0]**2, dtype=int) 
for i, x in enumerate(product(df.Name, repeat=2)): 
    dist[i] = editdistance.eval(*x)

dist_df = pd.DataFrame(dist.reshape(-1, df.shape[0]))

dist_df

    0   1   2   3   4   5   6   7   8   9   10  11  12  13  14
0    0   8   6   4   5   7   5   5   5   6   4   5   6   5   6
1    8   0   7   7   7   6   8   8   7   8   7   7   8   8   8
2    6   7   0   3   4   5   5   6   6   6   6   5   5   5   4
3    4   7   3   0   4   6   5   5   5   6   4   4   6   4   5
4    5   7   4   4   0   6   5   5   5   6   5   3   5   4   4
5    7   6   5   6   6   0   6   6   6   7   6   5   7   7   6
6    5   8   5   5   5   6   0   2   6   6   5   5   3   6   5
7    5   8   6   5   5   6   2   0   6   6   5   5   4   6   6
8    5   7   6   5   5   6   6   6   0   1   1   5   5   5   6
9    6   8   6   6   6   7   6   6   1   0   2   5   6   6   6
10   4   7   6   4   5   6   5   5   1   2   0   4   5   4   5
11   5   7   5   4   3   5   5   5   5   5   4   0   4   4   3
12   6   8   5   6   5   7   3   4   5   6   5   4   0   4   4
13   5   8   5   4   4   7   6   6   5   6   4   4   4   0   1
14   6   8   4   5   4   6   5   6   6   6   5   3   4   1   0

np.empty initialises an empty array, which you then fill up through each call to editdistance.eval.

Borrowing from senderle's cartesian_product, we can achieve some speed gains:

def cartesian_product(*arrays):
    la = len(arrays)
    dtype = np.result_type(*arrays)
    arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
    for i, a in enumerate(np.ix_(*arrays)):
        arr[...,i] = a
    return arr.reshape(-1, la)

v = np.apply_along_axis(func1d=lambda x: editdistance.eval(*x), 
           arr=cartesian_product(df.Name, df.Name), axis=1).reshape(-1, df.shape[0])

dist_df = pd.DataFrame(v)

Alternatively, you could define a function to compute edit distance and vectorise it:

def f(x, y):
    return editdistance.eval(x, y)

v = np.vectorize(f)

arr = cartesian_product(df.Name, df.Name).T
arr = v(arr[0, :], arr[1, :])

dist_df = pd.DataFrame(arr.reshape(-1, df.shape[0]))

If you need annotated index and columns, you can just add it when constructing dist_df:

dist_df = pd.DataFrame(..., index=df.Name, columns=df.Name)

dist_df

Name       John  Mrinmayee  rituja  ritz  divya  priyanka  chetna  chetan  \
Name                                                                        
John          0          8       6     4      5         7       5       5   
Mrinmayee     8          0       7     7      7         6       8       8   
rituja        6          7       0     3      4         5       5       6   
ritz          4          7       3     0      4         6       5       5   
divya         5          7       4     4      0         6       5       5   
priyanka      7          6       5     6      6         0       6       6   
chetna        5          8       5     5      5         6       0       2   
chetan        5          8       6     5      5         6       2       0   
mansi         5          7       6     5      5         6       6       6   
mansvi        6          8       6     6      6         7       6       6   
mani          4          7       6     4      5         6       5       5   
aliya         5          7       5     4      3         5       5       5   
shelia        6          8       5     6      5         7       3       4   
Dilip         5          8       5     4      4         7       6       6   
Dilipa        6          8       4     5      4         6       5       6   

Name       mansi  mansvi  mani  aliya  shelia  Dilip  Dilipa  
Name                                                          
John           5       6     4      5       6      5       6  
Mrinmayee      7       8     7      7       8      8       8  
rituja         6       6     6      5       5      5       4  
ritz           5       6     4      4       6      4       5  
divya          5       6     5      3       5      4       4  
priyanka       6       7     6      5       7      7       6  
chetna         6       6     5      5       3      6       5  
chetan         6       6     5      5       4      6       6  
mansi          0       1     1      5       5      5       6  
mansvi         1       0     2      5       6      6       6  
mani           1       2     0      4       5      4       5  
aliya          5       5     4      0       4      4       3  
shelia         5       6     5      4       0      4       4  
Dilip          5       6     4      4       4      0       1  
Dilipa         6       6     5      3       4      1       0

来源：https://stackoverflow.com/questions/47156739/compute-edit-distance-for-a-dataframe-which-has-only-column-and-multiple-rows-in

标签

python

pandas

dataframe

edit-distance