Python fuzzywuzzy error string or buffer expect

大城市里の小女人 提交于 2019-12-10 17:47:31

问题


I'm using fuzzywuzzy to find near matches in a csv of company names. I'm comparing manually matched strings with the unmatched strings in the hope of finding some useful proximity matches, however, I'm getting a string or buffer error within fuzzywuzzy. My code is:

from fuzzywuzzy import process
from pandas import read_csv

if __name__ == '__main__':
    df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
    df_false = df[df['match_manual'].isnull()]  
    df_true = df[df['match_manual'].notnull()]
    sss_false = df_false['sss'].values.tolist()
    sss_true = df_true['sss'].values.tolist()


    for sssf in sss_false:
        mmm = process.extractOne(sssf, sss_true) # find best choice
        print sssf + str(tuple(mmm))

This creates the following error:

Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer

This is something to do with the effects of importing into pandas with encoding specified, which I added to prevent UnicodeDecodeErrors but had the knock on effect of causing this error. I've tried to force the object using str(sssf) but that doesn't work.

So, I've isolated a line that is causing the error, here: #N/A,,,,,, (line 29 in code pasted below). I assumed it was the # that was causing the error, but strangely its not, its the A char that is causing the problem, because the file works when it is removed. What is strange to me is that the string two rows below is N/A which parses fine, however, row 29 won't parse when I delete the # symbol, even though the field appears identical to the field below.

sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,

回答1:


By default, pandas.read_csv parses the string 'N/A' as Not a Number (NaN)

In your case, that means that you end up with a nan value rather than a string. In your sample data set, this happens in two places

The third line from the bottom (the line you highlight in the question) results in sss_false[-3] == nan

The last line results in sss_true[-1] == nan.


Option 1

If you want to parse the string 'N/A' as a string instead of nan, the way to do this is to replace

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")

with

df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')

The meaning of these extra options is described in the pandas docs.

na_values : list-like or dict, default None

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values

keep_default_na : bool, default True

If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to

So, the above modification tells pandas to recognize the empty string as NA and discard the default value 'N/A'


Option 2

If you want to discard lines with 'N/A' in the first column you need to remove the nan members from sss_true and sss_false. one way to do this is:

sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]



回答2:


Your sss_true variable contains:

[
    u'N21 LTD.',
    u'N2 CHECK LIMITED',
    u'N2 CHECK LTD',
    u'N2 GROUP LTD',
    u'N2 VISUAL COMMUNICATIONS LTD',
    u'N3 DISPLAY GRAPHICS LTD',
    u'N3O LIMITED',
    u'N9 DESIGN',
    nan              # <---- note this
]

Once you get rid of that not-a-number value everything starts to work as expected.



来源:https://stackoverflow.com/questions/30631879/python-fuzzywuzzy-error-string-or-buffer-expect

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!