Trying to update each row from df1 to df2 if an unique value is matched. If not, append the row to df2 and assign new ID column.
unique_value Status Price
0 xyz123 bad 6.67
1 eff987 bad 1.75
2 efg125 okay 5.77
unique_value Status Price ID
0 xyz123 good 1.25 1000
1 xyz123 good 1.25 1000
2 xyz123 good 1.25 1000
3 xyz123 good 1.25 1000
4 xyz985 bad 1.31 1001
5 abc987 okay 4.56 1002
6 eff987 good 9.85 1003
7 asd541 excellent 8.85 1004
Desired output for updated df2:
unique_value Status Price ID
0 xyz123 bad 6.67 1000 <-updated
1 xyz123 bad 6.67 1000 <-updated
2 xyz123 bad 6.67 1000 <-updated
3 xyz123 bad 6.67 1000 <-updated
4 xyz985 bad 1.31 1001
5 abc987 okay 4.56 1002
6 eff987 bad 1.75 1003 <-updated
7 asd541 excellent 8.85 1004
8 efg125 okay 5.77 1005 <-appended
Here is what I have done so far:
for i in range(0, len(df1)):
if df1['unique_value'].isin(df2['unique_value'])[i] == True:
... update row in df2
df2 = df2.append(i)
... assign row with new ID using pd.factorize and ID value at df2['ID'].max()+1
Note that I initial used pd.factorize
to assign ID based on unique_value for df2
with values starting at 1000
, 1001
(and so on) using this code: df2['ID'] = pd.factorize(df2['unique_value'])[0] + 1000
I tried using this solution (Updating a dataframe rows based on another dataframe rows), however it indexes my unique_value column, which prevents me from iterating another dataset moving forward.
Any way we can script this?
My strategies of implementing the two parts is explained as follows.
- Update existing rows:
can be updated via broadcasting, provided that the shape of the row fromdf1
is correctly reshaped into(1, 3)
. The broadcasting concept inpandas
is identical to that ofnumpy
. - Append new rows: Assuming a consecutive index counting up from
, a new row can be easily appended by directly callingdf2.loc[len(df2), :] = ...
, wherelen(df2)
is the next unused natural number for the index column. Example: this answer.
In addition, 2 additional state variables are constructed in my solution, as I think they would be more efficient than having to search through the entire df2
every time. They can of course be discarded if this is not a problem.
# additional state variables
# 1. for the ID to be added
current_max_id = df2["ID"].max()
# 2. for matching unique_values, avoiding searching df2["unique_value"] every time
current_value_set = set(df2["unique_value"].values)
# match unique_value's using the state variable instead of `df2`
mask = df1["unique_value"].isin(current_value_set)
for i in range(len(df1)):
# current unique_value from df1
uv1 = df1["unique_value"][i]
# 1. update existing
if mask[i]:
# broadcast df1 into the matched rows in df2 (mind the shape)
df2.loc[df2["unique_value"] == uv1, ["unique_value", "Status", "Price"]] = df1.iloc[i, :].values.reshape((1, 3))
# 2. append new
# update state variables
current_max_id += 1
# append the row (assumes df2.index=[0,1,2,3,...])
df2.loc[len(df2), :] = [df1.iloc[i, 0], df1.iloc[i, 1], df1.iloc[i, 2], current_max_id]
unique_value Status Price ID
0 xyz123 bad 6.67 1000.0
1 xyz123 bad 6.67 1000.0
2 xyz123 bad 6.67 1000.0
3 xyz123 bad 6.67 1000.0
4 xyz985 bad 1.31 1001.0
5 abc987 okay 4.56 1002.0
6 eff987 bad 1.75 1003.0
7 asd541 excellent 8.85 1004.0
8 efg125 okay 5.77 1005.0
Tested with python 3.7, pandas 1.1.2, OS=debian 10 64-bit