问题
I'm using fuzzywuzzy to find near matches in a csv of company names. I'm comparing manually matched strings with the unmatched strings in the hope of finding some useful proximity matches, however, I'm getting a string or buffer error within fuzzywuzzy. My code is:
from fuzzywuzzy import process
from pandas import read_csv
if __name__ == '__main__':
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
df_false = df[df['match_manual'].isnull()]
df_true = df[df['match_manual'].notnull()]
sss_false = df_false['sss'].values.tolist()
sss_true = df_true['sss'].values.tolist()
for sssf in sss_false:
mmm = process.extractOne(sssf, sss_true) # find best choice
print sssf + str(tuple(mmm))
This creates the following error:
Traceback (most recent call last):
File "fuzzywuzzy_usm2_csv_test.py", line 21, in <module>
mmm = process.extractOne(sssf, sss_true) # find best choice
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 123, in extractOne
best_list = extract(query, choices, processor, scorer, limit=1)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/process.py", line 84, in extract
processed = processor(choice)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/utils.py", line 63, in full_process
string_out = StringProcessor.replace_non_letters_non_numbers_with_whitespace(s)
File "/usr/local/lib/python2.7/site-packages/fuzzywuzzy/string_processing.py", line 25, in replace_non_letters_non_numbers_with_whitespace
return cls.regex.sub(u" ", a_string)
TypeError: expected string or buffer
This is something to do with the effects of importing into pandas with encoding specified, which I added to prevent UnicodeDecodeErrors
but had the knock on effect of causing this error. I've tried to force the object using str(sssf)
but that doesn't work.
So, I've isolated a line that is causing the error, here: #N/A,,,,,,
(line 29 in code pasted below). I assumed it was the #
that was causing the error, but strangely its not, its the A
char that is causing the problem, because the file works when it is removed. What is strange to me is that the string two rows below is N/A
which parses fine, however, row 29 won't parse when I delete the #
symbol, even though the field appears identical to the field below.
sss,sid,match_manual,notes,match_date,source,match_by
N20 KIDS,1095543_cha,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 FESTIVAL,08190588_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N21 LTD,,,,,,
N21 LTD.,04615294_com,true,,2014-12-02,,OpenCorps
N2 CHECK,08105000_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 CHECK LIMITED,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LIMITED,08184223_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 3)
N 2 CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 CHECK LTD,06139690_com,true,,2014-12-02,,OpenCorps
N2CHECK LTD,05729595_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2E & BACK LTD,05218805_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N2 GROUP LLC,04627044_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 GROUP LTD,04475764_com,true,,2014-05-05,data taken from u_supplier_match,20140429_fuzzy_match.ktr (stream 2)
N2R PRODUCTIONS,SC266951_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N2 VISUAL COMMUNICATIONS LIMITED,,,,,,
N2 VISUAL COMMUNICATIONS LTD,03144224_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N2WEB,07636689_com,,,2014-10-12,,20140429_fuzzy_match.ktr (stream 3)
N3 DISPLAY GRAPHICS LTD,04008480_com,true,,2014-12-02,data taken from u_supplier_match,OpenCorps
N3O LIMITED,06561158_com,true,,2014-12-02,,OpenCorps
N3O LTD,,,,,,
N400138,,,,,,
N400360,,,,,,
N4K LTD,07054740_com,,,2014-05-05,,20140429_fuzzy_match.ktr (stream 2)
N51 LTD,,,,,,
N68 LTD,,,,,,
N8 LTD,,,,,,
N9 DESIGN,07342091_com,true,,2015-02-07,openrefine/opencorporates,IM
#N/A,,,,,,
N A,,,,,,
N/A,red_general_xtr,true,Matches done manually,2015-04-16,manual matching,IM
(N) A & A BUILDERS LTD,,,,,,
回答1:
By default, pandas.read_csv parses the string 'N/A'
as Not a Number (NaN
)
In your case, that means that you end up with a nan
value rather than a string. In your sample data set, this happens in two places
The third line from the bottom (the line you highlight in the question) results in sss_false[-3] == nan
The last line results in sss_true[-1] == nan
.
Option 1
If you want to parse the string 'N/A'
as a string instead of nan
, the way to do this is to replace
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1")
with
df = read_csv("usm_clean.csv", encoding = "ISO-8859-1", keep_default_na=False, na_values='')
The meaning of these extra options is described in the pandas docs.
na_values : list-like or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values
keep_default_na : bool, default True
If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to
So, the above modification tells pandas to recognize the empty string as NA and discard the default value 'N/A'
Option 2
If you want to discard lines with 'N/A'
in the first column you need to remove the nan
members from sss_true
and sss_false
. one way to do this is:
sss_true = [x for x in sss_true if type(x) != str]
sss_false = [x for x in sss_false if type(x) != str]
回答2:
Your sss_true
variable contains:
[
u'N21 LTD.',
u'N2 CHECK LIMITED',
u'N2 CHECK LTD',
u'N2 GROUP LTD',
u'N2 VISUAL COMMUNICATIONS LTD',
u'N3 DISPLAY GRAPHICS LTD',
u'N3O LIMITED',
u'N9 DESIGN',
nan # <---- note this
]
Once you get rid of that not-a-number value everything starts to work as expected.
来源:https://stackoverflow.com/questions/30631879/python-fuzzywuzzy-error-string-or-buffer-expect