Pandas: how to merge two dataframes on a column by keeping the information of the first one?

后端 未结 2 736
没有蜡笔的小新
没有蜡笔的小新 2020-11-21 06:46

I have two dataframes df1 and df2. df1 contains the information of the age of people, while df2 contains the information

相关标签:
2条回答
  • 2020-11-21 07:01

    Sample:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    #print (df1)
    df3 = df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F']})
    #print (df2)
    

    Use map by Series created by set_index:

    df1['Sex'] = df1['Name'].map(df2.set_index('Name')['Sex'])
    print (df1)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    Alternative solution with merge with left join:

    df = df3.merge(df2[['Name','Sex']], on='Name', how='left')
    print (df)
        Name  Age  Sex
    0    Tom   34    M
    1   Sara   18  NaN
    2    Eva   44    F
    3   Jack   27    M
    4  Laura   30  NaN
    

    If need map by multiple columns (e.g. Year and Code) need merge with left join:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Year':[2000,2003,2003,2004,2007],
                        'Code':[1,2,3,4,4],
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Year  Code  Age
    0    Tom  2000     1   34
    1   Sara  2003     2   18
    2    Eva  2003     3   44
    3   Jack  2004     4   27
    4  Laura  2007     4   30
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Paul', 'Eva', 'Jack', 'Michelle'], 
                        'Sex': ['M', 'M', 'F', 'M', 'F'],
                        'Year':[2001,2003,2003,2004,2007],
                        'Code':[1,2,3,5,3],
                        'Val':[21,34,23,44,67]})
    print (df2)
           Name Sex  Year  Code  Val
    0       Tom   M  2001     1   21
    1      Paul   M  2003     2   34
    2       Eva   F  2003     3   23
    3      Jack   M  2004     5   44
    4  Michelle   F  2007     3   67
    
    #merge by all columns
    df = df1.merge(df2, on=['Year','Code'], how='left')
    print (df)
      Name_x  Year  Code  Age Name_y  Sex   Val
    0    Tom  2000     1   34    NaN  NaN   NaN
    1   Sara  2003     2   18   Paul    M  34.0
    2    Eva  2003     3   44    Eva    F  23.0
    3   Jack  2004     4   27    NaN  NaN   NaN
    4  Laura  2007     4   30    NaN  NaN   NaN
    
    #specified columns - columns for join (Year, Code) need always + appended columns (Val)
    df = df1.merge(df2[['Year','Code', 'Val']], on=['Year','Code'], how='left')
    print (df)
        Name  Year  Code  Age   Val
    0    Tom  2000     1   34   NaN
    1   Sara  2003     2   18  34.0
    2    Eva  2003     3   44  23.0
    3   Jack  2004     4   27   NaN
    4  Laura  2007     4   30   NaN
    

    If get error with map it means duplicates by columns of join, here Name:

    df1 = pd.DataFrame({'Name': ['Tom', 'Sara', 'Eva', 'Jack', 'Laura'], 
                        'Age': [34, 18, 44, 27, 30]})
    
    print (df1)
        Name  Age
    0    Tom   34
    1   Sara   18
    2    Eva   44
    3   Jack   27
    4  Laura   30
    
    df3, df4 = df1.copy(), df1.copy()
    
    df2 = pd.DataFrame({'Name': ['Tom', 'Tom', 'Eva', 'Jack', 'Michelle'], 
                        'Val': [1,2,3,4,5]})
    print (df2)
           Name  Val
    0       Tom    1 <-duplicated name Tom
    1       Tom    2 <-duplicated name Tom
    2       Eva    3
    3      Jack    4
    4  Michelle    5
    
    s = df2.set_index('Name')['Val']
    df1['New'] = df1['Name'].map(s)
    print (df1)
    

    InvalidIndexError: Reindexing only valid with uniquely valued Index objects

    Solutions are removed duplicates by DataFrame.drop_duplicates, or use map by dict for last dupe match:

    #default keep first value
    s = df2.drop_duplicates('Name').set_index('Name')['Val']
    print (s)
    Name
    Tom         1
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df1['New'] = df1['Name'].map(s)
    print (df1)
        Name  Age  New
    0    Tom   34  1.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #add parameter for keep last value 
    s = df2.drop_duplicates('Name', keep='last').set_index('Name')['Val']
    print (s)
    Name
    Tom         2
    Eva         3
    Jack        4
    Michelle    5
    Name: Val, dtype: int64
    
    df3['New'] = df3['Name'].map(s)
    print (df3)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    #map by dictionary
    d = dict(zip(df2['Name'], df2['Val']))
    print (d)
    {'Tom': 2, 'Eva': 3, 'Jack': 4, 'Michelle': 5}
    
    df4['New'] = df4['Name'].map(d)
    print (df4)
        Name  Age  New
    0    Tom   34  2.0
    1   Sara   18  NaN
    2    Eva   44  3.0
    3   Jack   27  4.0
    4  Laura   30  NaN
    
    0 讨论(0)
  • 2020-11-21 07:18

    You can also use the join method:

    df1.set_index("Name").join(df2.set_index("Name"), how="left")
    

    edit: added set_index("Name")

    0 讨论(0)
提交回复
热议问题