Iterate each row by updating values from 1st dataframe to 2nd dataframe based on unique value w/ different index, otherwise append and assign new ID

六月ゝ 毕业季﹏ 提交于 2021-02-10 07:33:17

问题


Trying to update each row from df1 to df2 if an unique value is matched. If not, append the row to df2 and assign new ID column.

df1 (NO ID COLUMN):

        unique_value        Status        Price
0       xyz123              bad           6.67
1       eff987              bad           1.75
2       efg125              okay          5.77

df2:

        unique_value        Status        Price        ID
0       xyz123              good          1.25         1000
1       xyz123              good          1.25         1000
2       xyz123              good          1.25         1000
3       xyz123              good          1.25         1000
4       xyz985              bad           1.31         1001
5       abc987              okay          4.56         1002
6       eff987              good          9.85         1003
7       asd541              excellent     8.85         1004

Desired output for updated df2:

        unique_value        Status        Price        ID
0       xyz123              bad           6.67         1000     <-updated
1       xyz123              bad           6.67         1000     <-updated
2       xyz123              bad           6.67         1000     <-updated
3       xyz123              bad           6.67         1000     <-updated
4       xyz985              bad           1.31         1001
5       abc987              okay          4.56         1002
6       eff987              bad           1.75         1003     <-updated
7       asd541              excellent     8.85         1004
8       efg125              okay          5.77         1005     <-appended  

Here is what I have done so far:

for i in range(0, len(df1)):
    if df1['unique_value'].isin(df2['unique_value'])[i] == True:
        ... update row in df2
    else:
        df2 = df2.append(i)
        ... assign row with new ID using pd.factorize and ID value at df2['ID'].max()+1

Note that I initial used pd.factorize to assign ID based on unique_value for df2 with values starting at 1000, 1001 (and so on) using this code: df2['ID'] = pd.factorize(df2['unique_value'])[0] + 1000

I tried using this solution (Updating a dataframe rows based on another dataframe rows), however it indexes my unique_value column, which prevents me from iterating another dataset moving forward.

Any way we can script this?

Thanks!


回答1:


My strategies of implementing the two parts is explained as follows.

  1. Update existing rows: df2 can be updated via broadcasting, provided that the shape of the row from df1 is correctly reshaped into (1, 3). The broadcasting concept in pandas is identical to that of numpy.
  2. Append new rows: Assuming a consecutive index counting up from 0, a new row can be easily appended by directly calling df2.loc[len(df2), :] = ..., where len(df2) is the next unused natural number for the index column. Example: this answer.

In addition, 2 additional state variables are constructed in my solution, as I think they would be more efficient than having to search through the entire df2 every time. They can of course be discarded if this is not a problem.

Code:

# additional state variables
# 1. for the ID to be added
current_max_id = df2["ID"].max()
# 2. for matching unique_values, avoiding searching df2["unique_value"] every time
current_value_set = set(df2["unique_value"].values)

# match unique_value's using the state variable instead of `df2`
mask = df1["unique_value"].isin(current_value_set)

for i in range(len(df1)):
    
    # current unique_value from df1
    uv1 = df1["unique_value"][i]
    
    # 1. update existing
    if mask[i]:
        
        # broadcast df1 into the matched rows in df2 (mind the shape)
        df2.loc[df2["unique_value"] == uv1, ["unique_value", "Status", "Price"]] = df1.iloc[i, :].values.reshape((1, 3))
        
    # 2. append new
    else:
        # update state variables
        current_max_id += 1
        current_value_set.add(uv1)
        # append the row (assumes df2.index=[0,1,2,3,...])
        df2.loc[len(df2), :] = [df1.iloc[i, 0], df1.iloc[i, 1], df1.iloc[i, 2], current_max_id]

Output:

df2
Out[45]: 
  unique_value     Status  Price      ID
0       xyz123        bad   6.67  1000.0
1       xyz123        bad   6.67  1000.0
2       xyz123        bad   6.67  1000.0
3       xyz123        bad   6.67  1000.0
4       xyz985        bad   1.31  1001.0
5       abc987       okay   4.56  1002.0
6       eff987        bad   1.75  1003.0
7       asd541  excellent   8.85  1004.0
8       efg125       okay   5.77  1005.0

Tested with python 3.7, pandas 1.1.2, OS=debian 10 64-bit



来源:https://stackoverflow.com/questions/64198456/iterate-each-row-by-updating-values-from-1st-dataframe-to-2nd-dataframe-based-on

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!