问题
Trying to update each row from df1 to df2 if an unique value is matched. If not, append the row to df2 and assign new ID column.
df1 (NO ID COLUMN):
unique_value Status Price
0 xyz123 bad 6.67
1 eff987 bad 1.75
2 efg125 okay 5.77
df2:
unique_value Status Price ID
0 xyz123 good 1.25 1000
1 xyz123 good 1.25 1000
2 xyz123 good 1.25 1000
3 xyz123 good 1.25 1000
4 xyz985 bad 1.31 1001
5 abc987 okay 4.56 1002
6 eff987 good 9.85 1003
7 asd541 excellent 8.85 1004
Desired output for updated df2:
unique_value Status Price ID
0 xyz123 bad 6.67 1000 <-updated
1 xyz123 bad 6.67 1000 <-updated
2 xyz123 bad 6.67 1000 <-updated
3 xyz123 bad 6.67 1000 <-updated
4 xyz985 bad 1.31 1001
5 abc987 okay 4.56 1002
6 eff987 bad 1.75 1003 <-updated
7 asd541 excellent 8.85 1004
8 efg125 okay 5.77 1005 <-appended
Here is what I have done so far:
for i in range(0, len(df1)):
if df1['unique_value'].isin(df2['unique_value'])[i] == True:
... update row in df2
else:
df2 = df2.append(i)
... assign row with new ID using pd.factorize and ID value at df2['ID'].max()+1
Note that I initial used pd.factorize
to assign ID based on unique_value for df2
with values starting at 1000
, 1001
(and so on) using this code: df2['ID'] = pd.factorize(df2['unique_value'])[0] + 1000
I tried using this solution (Updating a dataframe rows based on another dataframe rows), however it indexes my unique_value column, which prevents me from iterating another dataset moving forward.
Any way we can script this?
Thanks!
回答1:
My strategies of implementing the two parts is explained as follows.
- Update existing rows:
df2
can be updated via broadcasting, provided that the shape of the row fromdf1
is correctly reshaped into(1, 3)
. The broadcasting concept inpandas
is identical to that ofnumpy
. - Append new rows: Assuming a consecutive index counting up from
0
, a new row can be easily appended by directly callingdf2.loc[len(df2), :] = ...
, wherelen(df2)
is the next unused natural number for the index column. Example: this answer.
In addition, 2 additional state variables are constructed in my solution, as I think they would be more efficient than having to search through the entire df2
every time. They can of course be discarded if this is not a problem.
Code:
# additional state variables
# 1. for the ID to be added
current_max_id = df2["ID"].max()
# 2. for matching unique_values, avoiding searching df2["unique_value"] every time
current_value_set = set(df2["unique_value"].values)
# match unique_value's using the state variable instead of `df2`
mask = df1["unique_value"].isin(current_value_set)
for i in range(len(df1)):
# current unique_value from df1
uv1 = df1["unique_value"][i]
# 1. update existing
if mask[i]:
# broadcast df1 into the matched rows in df2 (mind the shape)
df2.loc[df2["unique_value"] == uv1, ["unique_value", "Status", "Price"]] = df1.iloc[i, :].values.reshape((1, 3))
# 2. append new
else:
# update state variables
current_max_id += 1
current_value_set.add(uv1)
# append the row (assumes df2.index=[0,1,2,3,...])
df2.loc[len(df2), :] = [df1.iloc[i, 0], df1.iloc[i, 1], df1.iloc[i, 2], current_max_id]
Output:
df2
Out[45]:
unique_value Status Price ID
0 xyz123 bad 6.67 1000.0
1 xyz123 bad 6.67 1000.0
2 xyz123 bad 6.67 1000.0
3 xyz123 bad 6.67 1000.0
4 xyz985 bad 1.31 1001.0
5 abc987 okay 4.56 1002.0
6 eff987 bad 1.75 1003.0
7 asd541 excellent 8.85 1004.0
8 efg125 okay 5.77 1005.0
Tested with python 3.7, pandas 1.1.2, OS=debian 10 64-bit
来源:https://stackoverflow.com/questions/64198456/iterate-each-row-by-updating-values-from-1st-dataframe-to-2nd-dataframe-based-on