Replace column values based on another dataframe python pandas - better way?

后端 未结 4 673
眼角桃花
眼角桃花 2020-12-01 02:42

Note:for simplicity\'s sake, i\'m using a toy example, because copy/pasting dataframes is difficult in stack overflow (please let me know if there\'s an easy way to do this)

相关标签:
4条回答
  • 2020-12-01 03:24
    df2.set_index('Name').combine_first(df1.set_index('Name')).reset_index()
    
    0 讨论(0)
  • 2020-12-01 03:26

    Use the boolean mask from isin to filter the df and assign the desired row values from the rhs df:

    In [27]:
    
    df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']]
    df
    Out[27]:
      Name  Nonprofit  Business  Education
    0    X          1         1          0
    1    Y          1         1          1
    2    Z          1         0          1
    3    Y          1         1          1
    
    [4 rows x 4 columns]
    
    0 讨论(0)
  • 2020-12-01 03:31

    In [27]: This is the correct one.

    df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = df1[['Nonprofit', 'Education']].values
    
    df
    Out[27]:
    
    Name  Nonprofit  Business  Education
    
    0    X          1         1          0
    1    Y          1         1          1
    2    Z          1         0          1
    3    Y          1         1          1
    

    [4 rows x 4 columns]

    The above will work only when all rows in df1 exists in df . In other words df should be super set of df1

    Incase if you have some non matching rows to df in df1,you should follow below

    In other words df is not superset of df1 :

    df.loc[df.Name.isin(df1.Name), ['Nonprofit', 'Education']] = 
    df1.loc[df1.Name.isin(df.Name),['Nonprofit', 'Education']].values
    
    0 讨论(0)
  • 2020-12-01 03:37

    Attention: In latest version of pandas, both answers above doesn't work anymore:

    KSD's answer will raise error:

    df1 = pd.DataFrame([["X",1,1,0],
                  ["Y",0,1,0],
                  ["Z",0,0,0],
                  ["Y",0,0,0]],columns=["Name","Nonprofit","Business", "Education"])    
    
    df2 = pd.DataFrame([["Y",1,1],
                  ["Z",1,1]],columns=["Name","Nonprofit", "Education"])   
    
    df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2.loc[df2.Name.isin(df1.Name),['Nonprofit', 'Education']].values
    
    df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']].values
    
    Out[851]:
    ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (3,)
    

    and EdChum's answer will give us the wrong result:

     df1.loc[df1.Name.isin(df2.Name), ['Nonprofit', 'Education']] = df2[['Nonprofit', 'Education']]
    
    df1
    Out[852]: 
      Name  Nonprofit  Business  Education
    0    X        1.0         1        0.0
    1    Y        1.0         1        1.0
    2    Z        NaN         0        NaN
    3    Y        NaN         1        NaN
    

    Well, it will work safely only if values in column 'Name' are unique and are sorted in both data frames.

    Here is my answer:

    Way 1:

    df1 = df1.merge(df2,on='Name',how="left")
    df1['Nonprofit_y'] = df1['Nonprofit_y'].fillna(df1['Nonprofit_x'])
    df1['Business_y'] = df1['Business_y'].fillna(df1['Business_x'])
    df1.drop(["Business_x","Nonprofit_x"],inplace=True,axis=1)
    df1.rename(columns={'Business_y':'Business','Nonprofit_y':'Nonprofit'},inplace=True)
    

    Way 2:

    df1 = df1.set_index('Name')
    df2 = df2.set_index('Name')
    df1.update(df2)
    df1.reset_index(inplace=True)
    

    More guide about update.. The columns names of both data frames need to set index are not necessary same before 'update'. You could try 'Name1' and 'Name2'. Also, it works even if other unnecessary row in df2, which won't update df1. In other words, df2 doesn't need to be the super set of df1.

    Example:

    df1 = pd.DataFrame([["X",1,1,0],
                  ["Y",0,1,0],
                  ["Z",0,0,0],
                  ["Y",0,1,0]],columns=["Name1","Nonprofit","Business", "Education"])    
    
    df2 = pd.DataFrame([["Y",1,1],
                  ["Z",1,1],
                  ['U',1,3]],columns=["Name2","Nonprofit", "Education"])   
    
    df1 = df1.set_index('Name1')
    df2 = df2.set_index('Name2')
    
    
    df1.update(df2)
    

    result:

          Nonprofit  Business  Education
    Name1                                
    X           1.0         1        0.0
    Y           1.0         1        1.0
    Z           1.0         0        1.0
    Y           1.0         1        1.0
    
    0 讨论(0)
提交回复
热议问题