I am new to pandas and can\'t seem to get this to work with merge function:
>>> left >>> right
a b c a c d
0 1 4
DataFrame.update() is nice, but it doesn't let you specify columns to join on and more importantly, if the other dataframe has NaN values, those NaN values will not overwrite non-nan values in the original DataFrame. To me, this is undesirable behavior.
Here's a custom method I rolled to fix these issues. It's freshly written, so users beware..
def join_insertion(into_df, from_df, on, cols, mult='error'):
"""
Suppose A and B are dataframes. A has columns {foo, bar, baz} and B has columns {foo, baz, buz}
This function allows you to do an operation like:
"where A and B match via the column foo, insert the values of baz and buz from B into A"
Note that this'll update A's values for baz and it'll insert buz as a new column.
This is a lot like DataFrame.update(), but that method annoyingly ignores NaN values in B!
:param into_df: dataframe you want to modify
:param from_df: dataframe with the values you want to insert
:param cols: list of column names (values to insert)
:param on: list of column names (values to join on), or a dict of {into:from} column name pairs
:param mult: if a key of into_df matches multiple rows of from_df, how should this be handled?
an error can be raised, or the first matching value can be inserted, or the last matching value
can be inserted
:return: a modified copy of into_df, with updated values using from_df
"""
# Infer left_on, right_on
if (isinstance(on, dict)):
left_on = list(on.keys())
right_on = list(on.values())
elif(isinstance(on, list)):
left_on = on
right_on = on
elif(isinstance(on, str)):
left_on = [on]
right_on = [on]
else:
raise Exception("on should be a list or dictionary")
# Make cols a list if it isn't already
if(isinstance(cols, str)):
cols = [cols]
# Setup
A = into_df.copy()
B = from_df[right_on + cols].copy()
# Insert row ids
A['_A_RowId_'] = np.arange(A.shape[0])
B['_B_RowId_'] = np.arange(B.shape[0])
A = pd.merge(
left=A,
right=B,
how='left',
left_on=left_on,
right_on=right_on,
suffixes=(None, '_y'),
indicator=True
).sort_values(['_A_RowId_', '_B_RowId_'])
# Check for rows of A which got duplicated by the merge, and then handle appropriately
if(mult == 'error'):
if(A.groupby('_A_RowId_').size().max() > 1):
raise Exception("At least one key of into_df matched multiple rows of from_df.")
elif(mult == 'first'):
A = A.groupby('_A_RowId_').first().reset_index()
elif(mult == 'last'):
A = A.groupby('_A_RowId_').last().reset_index()
mask = A._merge == 'both'
cols_in_both = list(set(into_df.columns.to_list()).intersection(set(cols)))
for col in cols_in_both:
A.loc[mask, col] = A.loc[mask, col + '_y']
# Drop unwanted columns
A.drop(columns=list(set(A.columns).difference(set(into_df.columns.to_list() + cols))), inplace=True)
return A
into_df = pd.DataFrame({
'foo': [1, 2, 3],
'bar': [4, 5, 6],
'baz': [7, 8, 9]
})
foo bar baz
0 1 4 7
1 2 5 8
2 3 6 9
from_df = pd.DataFrame({
'foo': [1, 3, 5, 7, 3],
'baz': [70, 80, 90, 30, 40],
'buz': [0, 1, 2, 3, 4]
})
foo baz buz
0 1 70 0
1 3 80 1
2 5 90 2
3 7 30 3
4 3 40 4
# Use it!
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='error')
Exception: At least one key of into_df matched multiple rows of from_df.
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='first')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 80.0 1.0
join_insertion(into_df, from_df, on='foo', cols=['baz','buz'], mult='last')
foo bar baz buz
0 1 4 70.0 0.0
1 2 5 8.0 NaN
2 3 6 40.0 4.0
As an aside, this is one of those things I severely miss from R's data.table package. With data.table, this is as easy as x[y, Foo := i.Foo, on = c("a", "b")]
One other way is to use pd.merge like so:
>>> import pandas as pd
>>> final = pd.merge(right, left,
how='outer',
left_index=True,
right_index=True,
on=('a', 'c')
).sort_index(axis=1)
>>> final
a b c d
0 1 4 7 13.0
1 2 5 8 14.0
2 3 6 9 15.0
3 4 7 12 NaN
You can compute the intersection of both DataFrames's columns names you want to update to pass it to the 'on=' parameter of the function.
It does not create unwanted columns that have to be dropped like with Zero's solution.
Edit: The NaN value might change integers to floats in the same column.
One way to do this is to set the a column as the index and update:
In [11]: left_a = left.set_index('a')
In [12]: right_a = right.set_index('a')
Note: update
only does a left join (not merges), so as well as set_index you also need to include the additional columns not present in left_a
.
In [13]: res = left_a.reindex(columns=left_a.columns.union(right_a.columns))
In [14]: res.update(right_a)
In [15]: res.reset_index(inplace=True)
In [16]: res
Out[16]:
a b c d
0 1 4 7 13
1 2 5 8 14
2 3 6 9 15
3 4 7 12 NaN
Here's a way to do it with join
:
In [632]: t = left.set_index('a').join(right.set_index('a'), rsuffix='_right')
In [633]: t
Out[633]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 NaN NaN
Now, we want to set null values of c_right
(which is from the right
dataframe) with values from c
column from the left
dataframe. Updated the below process with a method taking from @John Galt's answer
In [657]: t['c_right'] = t['c_right'].fillna(t['c'])
In [658]: t
Out[658]:
b c c_right d
a
1 4 9 7 13
2 5 10 8 14
3 6 11 9 15
4 7 12 12 NaN
In [659]: t.drop('c_right', axis=1)
Out[659]:
b c d
a
1 4 9 13
2 5 10 14
3 6 11 15
4 7 12 NaN
You can use merge()
between left
and right
with how='left'
on 'a'
column.
In [74]: final = left.merge(right, on='a', how='left')
In [75]: final
Out[75]:
a b c_x c_y d
0 1 4 9 7 13
1 2 5 10 8 14
2 3 6 11 9 15
3 4 7 12 NaN NaN
Replace NaN
value from c_y
with c_x
value
In [76]: final['c'] = final['c_y'].fillna(final['c_x'])
In [77]: final
Out[77]:
a b c_x c_y d c
0 1 4 9 7 13 7
1 2 5 10 8 14 8
2 3 6 11 9 15 9
3 4 7 12 NaN NaN 12
Drop unwanted columns, and you have the result
In [79]: final.drop(['c_x', 'c_y'], axis=1)
Out[79]:
a b d c
0 1 4 13 7
1 2 5 14 8
2 3 6 15 9
3 4 7 NaN 12