I have 2 lists of over a million names with slightly different naming conventions. The goal here it to match those records that are similar, with the logic of 95% confidence
There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.
Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:
This would result in "Andrew H Smith"
, "andrew h. smith"
, "ándréw h. smith"
generating same key "andrewhsmith"
, and would reduce your set of million names to a smaller set of unique/similar grouped names.
You can use this utlity method to normalize your string (does not include the unicode part though) :
def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
""" Processes string for similarity comparisons , cleans special characters and extra whitespaces
if normalized is True and removes the substrings which are in ignore_list)
Args:
input_str (str) : input string to be processed
normalized (bool) : if True , method removes special characters and extra whitespace from string,
and converts to lowercase
ignore_list (list) : the substrings which need to be removed from the input string
Returns:
str : returns processed string
"""
for ignore_str in ignore_list:
input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)
if normalized is True:
input_str = input_str.strip().lower()
#clean special chars and extra whitespace
input_str = re.sub("\W", "", input_str).strip()
return input_str
Now similar strings will already lie in the same bucket if their normalized key is same.
For further comparison, you will need to compare the keys only, not the names. e.g
andrewhsmith
and andrewhsmeeth
, since this similarity
of names will need fuzzy string matching apart from the normalized
comparison done above.
Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.
Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith
, ándréw h. smith
, and Andrew H. Smeeth
generate keys andrewhsmith
,andrewhsmith
, and andrewhsmeeth
respectively.
They will usually not differ in the first character, so you can run matching for keys starting with a
to other keys which start with a
, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith
to bndrewhsmith
as such a name variation with first letter will rarely exist.
Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:
def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
""" Calculates matching ratio between two strings
Args:
first_str (str) : First String
second_str (str) : Second String
normalized (bool) : if True ,method removes special characters and extra whitespace
from strings then calculates matching ratio
ignore_list (list) : list has some characters which has to be substituted with "" in string
Returns:
Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
equal weightage to each
Examples:
>>> find_string_similarity("hello world","Hello,World!",normalized=True)
1.0
>>> find_string_similarity("entrepreneurship","entreprenaurship")
0.95625
>>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
1.0
"""
first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
return match_ratio
You have to index, or normalize the strings to avoid the O(n^2) run. Basically, you have to map each string to a normal form, and to build a reverse dictionary with all the words linked to corresponding normal forms.
Let's consider that normal forms of 'world' and 'word' are the same. So, first build a reversed dictionary of Normalized -> [word1, word2, word3],
e.g.:
"world" <-> Normalized('world')
"word" <-> Normalized('wrd')
to:
Normalized('world') -> ["world", "word"]
There you go - all the items (lists) in the Normalized dict which have more than one value - are the matched words.
The normalization algorithm depends on data i.e. the words. Consider one of the many:
Specific to fuzzywuzzy, note that currently process.extractOne defaults to WRatio which is by far the slowest of their algorithms, and processor defaults to utils.full_process. If you pass in say fuzz.QRatio as your scorer it will go much quicker, but not as powerful depending on what you're trying to match. May be just fine for names though. I personally have good luck with token_set_ratio which is at least somewhat quicker than WRatio. You can also run utils.full_process() on all your choices beforehand and then run it with fuzz.ratio as your scorer and processor=None to skip the processing step. (see below) If you're just using the basic ratio function fuzzywuzzy is probably overkill though. Fwiw I have a JavaScript port (fuzzball.js) where you can pre-calculate the token sets too and use those instead of recalculating each time.)
This doesn't cut down the sheer number of comparisons but it helps. (BK-tree for this possibly? Been looking into same situation myself)
Also be sure to have python-Levenshtein installed so you use the faster calculation.
**The behavior below may change, open issues under discussion etc.**
fuzz.ratio doesn't run full process, and the token_set and token_sort functions accept a full_process=False param, and If you don't set Processor=None the extract function will try to run full process anyway. Can use functools' partial to say pass in fuzz.token_set_ratio with full_process=False as your scorer, and run utils.full_process on your choices beforehand.