问题
I have a dataframe df_sample
with 10 parsed addresses and am comparing it to another dataframe with hundreds of thousands of parsed address records df
. Both df_sample
and df
share the exact same structure:
zip_code city state street_number street_name unit_number country
12345 FAKEVILLE FLORIDA 123 FAKE ST NaN US
What I want to do is match a single row in df_sample
against every row in df
, starting with state
and take only the rows where the fuzzy.ratio(df['state'], df_sample['state']) > 0.9
into a new dataframe. Once this new, smaller dataframe is created from those matches, I would continue to do this for city
, zip_code
, etc. Something like:
df_match = df[fuzzy.ratio(df_sample['state'], df['state']) > 0.9]
except that doesn't work.
My goal is to narrow down the number of matches each time I use a harder search criterion, and eventually end up with a dataframe with as few matches as possible based on narrowing it down by each column individually. But I am unsure as to how to do this for any single record.
回答1:
Create your dataframes
import pandas as pd
from fuzzywuzzy import fuzz
df = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [1, 2, 3, 4, 5],
'state': ['Florida', 'Nevada', 'Texas', 'Florida', 'Texas']})
df_sample = pd.DataFrame({'key': [1, 1, 1, 1, 1],
'zip': [6, 7, 8, 9, 10],
'state': ['florida', 'Flor', 'NY', 'Florida', 'Tx']})
merged_df = df_sample.merge(df, on='key')
merged_df['fuzzy_ratio'] = merged_df.apply(lambda row: fuzz.ratio(row['state_x'], row['state_y']), axis=1)
merged_df
you get the fuzzy ratio for each pair
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
1 1 6 florida 2 Nevada 31
2 1 6 florida 3 Texas 17
3 1 6 florida 4 Florida 86
4 1 6 florida 5 Texas 17
5 1 7 Flor 1 Florida 73
6 1 7 Flor 2 Nevada 0
7 1 7 Flor 3 Texas 0
8 1 7 Flor 4 Florida 73
9 1 7 Flor 5 Texas 0
10 1 8 NY 1 Florida 0
11 1 8 NY 2 Nevada 25
12 1 8 NY 3 Texas 0
13 1 8 NY 4 Florida 0
14 1 8 NY 5 Texas 0
15 1 9 Florida 1 Florida 100
16 1 9 Florida 2 Nevada 31
17 1 9 Florida 3 Texas 17
18 1 9 Florida 4 Florida 100
19 1 9 Florida 5 Texas 17
20 1 10 Tx 1 Florida 0
21 1 10 Tx 2 Nevada 0
22 1 10 Tx 3 Texas 57
23 1 10 Tx 4 Florida 0
24 1 10 Tx 5 Texas 57
then filter out what you don't want
mask = (merged_df['fuzzy_ratio']>80)
merged_df[mask]
result:
key zip_x state_x zip_y state_y fuzzy_ratio
0 1 6 florida 1 Florida 86
3 1 6 florida 4 Florida 86
15 1 9 Florida 1 Florida 100
18 1 9 Florida 4 Florida 100
回答2:
I'm not familiar with fuzzy
, so this is more of a comment than an answer. That said, you can do something like this:
# cross join
df_merge = pd.merge(*[d.assign(dummy=1) for d in (df, df_sample)],
on='dummy', how='left'
)
filters = pd.DataFrame()
# compute the fuzzy ratio for each pair of columns
for col in df.columns:
filters[col] = (df_merge[[col+'_x', col+'_y']]
.apply(lambda x: fuzzy.ratio(x[col+'_x'], x[col+'_y']), axis=1)
)
# filter only those with ratio > 0.9
df_match = df_merge[filter.gt(0.9).all(1)]
回答3:
You wrote that your df has very big number of rows, so full cross-join and then elimination may cause your code to run out of memory.
Take a look at another solution, requiring less memory:
minRatio = 90
result = []
for idx1, t1 in df_sample.state.iteritems():
for idx2, t2 in df.state.iteritems():
ratio = fuzz.WRatio(t1, t2)
if ratio > minRatio:
result.append([ idx1, t1, idx2, t2, ratio ])
df2 = pd.DataFrame(result, columns=['idx1', 'state1', 'idx2', 'state2', 'ratio'])
It contains 2 nested loops running over both DataFrames. The result is a DataFrame with rows containig:
- index and state from df_sample,
- index and state from df,
- the ratio.
This gives you information which rows in both DataFrames are "related" with each other.
The advantage is that you don't generate full cross join and (for now) you operate only on state columns, instead of full rows.
You didn't describe what exactly the final result should be, but I tink that based on the above code you will be able to proceed further.
来源:https://stackoverflow.com/questions/59312265/how-to-compare-a-value-in-one-dataframe-to-a-column-in-another-using-fuzzywuzzy