How do I fuzzy match items in a column of an array in python?

佐手、 提交于 2019-12-02 07:58:31

The Levenshtein edit distance is the most common way to perform fuzzy matching of strings. It is available in the python-Levenshtein package. Another popular distance is Jaro Winkler's distance, also available in the same package.

Assuming a simple array numpy array:

import numpy as np
import Levenshtein as lv

ar = np.array([
      'string'
    , 'stum'
    , 'Such'
    , 'Say'
    , 'nay'
    , 'powder'
    , 'hiden'
    , 'parrot'
    , 'ming'
    ])

We define helpers to give us indexes of Levenshtein and Jaro distances, between a string we have and all strings in the array.

def levenshtein(dist, string):
    return map(lambda x: x<dist, map(lambda x: lv.distance(string, x), ar))

def jaro(dist, string):
    return map(lambda x: x<dist, map(lambda x: lv.jaro_winkler(string, x), ar))

Now, note that Levenshtein distance is an integer value counted in number of characters, whilst Jaro's distance is a floating point value that normally varies between 0 and 1. Let's test this using np.where:

print ar[np.where(levenshtein(3, 'str'))]
print ar[np.where(levenshtein(5, 'str'))]
print ar[np.where(jaro(0.00000001, 'str'))]
print ar[np.where(jaro(0.9, 'str'))]

And we get:

['stum']
['string' 'stum' 'Such' 'Say' 'nay' 'ming']
['Such' 'Say' 'nay' 'powder' 'hiden' 'ming']
['string' 'stum' 'Such' 'Say' 'nay' 'powder' 'hiden' 'parrot' 'ming']
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!