问题
I am very new to python programming. I am trying to take a csv file that has two columns of string values and want to compare the similarity ratio of the string between both columns. Then I want to take the values and output the ratio in another file.
The csv may look like this:
Column 1|Column 2
tomato|tomatoe
potato|potatao
apple|appel
I want the output file to show for each row, how similar the string in Column 1 is to Column 2. I am using difflib to output the ratio score.
This is the code I have so far:
import csv
import difflib
f = open('test.csv')
csf_f = csv.reader(f)
row_a = []
row_b = []
for row in csf_f:
row_a.append(row[0])
row_b.append(row[1])
a = row_a
b = row_b
def similar(a, b):
return difflib.SequenceMatcher(a, b).ratio()
match_ratio = similar(a, b)
match_list = []
for row in match_ratio:
match_list.append(row)
with open("output.csv", "wb") as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(match_list)
f.close()
I get the error:
Traceback (most recent call last):
File "comparison.py", line 24, in <module>
for row in match_ratio:
TypeError: 'float' object is not iterable
I feel like I am not importing the column list correctly and running it against the sequencematcher function.
回答1:
Here is another way to get this done using pandas:
Consider your csv data is like this:
Column 1,Column 2
tomato,tomatoe
potato,potatao
apple,appel
CODE
import pandas as pd
import difflib as diff
#Read the CSV
df = pd.read_csv('datac.csv')
#Create a new column 'diff' and get the result of comparision to it
df['diff'] = df.apply(lambda x: diff.SequenceMatcher(None, x[0].strip(), x[1].strip()).ratio(), axis=1)
#Save the dataframe to CSV and you could also save it in other formats like excel, html etc
df.to_csv('outdata.csv',index=False)
Result
Column 1,Column 2 ,diff
tomato,tomatoe ,0.923076923077
potato,potatao ,0.923076923077
apple,appel ,0.8
回答2:
The for
loop you're setting up here expects something like an array where you have match_ratio
, and judging by the error you're getting, that's not what you have. It looks like you're missing the first argument for difflib.SequenceMatcher
, which should probably be None
. See 6.3.1 here: https://docs.python.org/3/library/difflib.html
Without that first argument specified, I think you're getting back 0.0
from difflib.SequenceMatcher
and then trying to run ratio
off of that. Even if you correct your SequenceMatcher
call, I think you'll still be trying to iterate on a single float value that ratio
is returning. I think you need to call SequenceMatcher
inside the loop for each set of values you're comparing.
So you'd wind up with a call more like this in your function: difflib.SequenceMatcher(None, a, b)
. Or if you'd prefer, since these are named arguments, you could do something like this: difflib.SequenceMatcher(a=a, b=b)
.
回答3:
Your sample file looks like it contains markup tags. Assuming you are actually reading a CSV file, the error you are getting is because match_ratio is not an iterable datatype, it's a floating point number -- the return value of your function: similar(). In your code, the function call would have to be contained within a for loop to call it for each a, b string pair. Here's a working example I created that does away with the explicit for loops and uses a list comprehension instead:
import csv
from difflib import SequenceMatcher
path_in = 'csv1.csv'
path_out = 'csv2.csv'
with open(path_in, 'r') as csv_file_in:
csv_reader = csv.reader(csv_file_in)
col_headers = csv_reader.next()
for row in csv_reader:
results = [[row[0],
row[1],
SequenceMatcher(None, row[0], row[1]).ratio()]
for row in csv_reader]
with open(path_out, 'wb') as csv_file_out:
col_headers.append('Ratio')
out_rows = [col_headers] + results
writer = csv.writer(csv_file_out, delimiter=',')
writer.writerows(out_rows)
In addition to the error you received you might also have run into a problem when instantiating the SequenceMatcher object -- its first parameter wasn't specified in your code. You can find more on list comprehensions and SequenceMatcher in the Python docs. Good luck in your future Python coding.
回答4:
You are getting that error because the records row[0] or row[1] contain most probably NaN values. Try forcing them to string first by making str(row[0]) and str(row[1])
回答5:
You are getting the error because you are running SequenceMatcher on the list of strings, rather than on the strings themselves. When you do this, you get back a single float value, rather than the list of ration values I think you were expecting.
If I understand what you are trying to do, then you don't need to read in the rows first. You can simply find the diff ratio as you iterate through the rows.
import csv
import difflib
match_list = []
with open('test.csv') as f:
csv_f = csv.reader(f)
for row in csv_f:
match_list.append([difflib.SequenceMatcher(a=row[0], b=row[1]).ratio()])
with open('output.csv', 'w') as f:
writer = csv.writer(f, delimiter=',')
writer.writerows(match_list)
来源:https://stackoverflow.com/questions/36802453/comparing-two-columns-of-a-csv-and-outputting-string-similarity-ratio-in-another