Pandas populate new dataframe column based on matching columns in another dataframe

前端 未结 5 1645
花落未央
花落未央 2020-12-25 14:15

I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column

相关标签:
5条回答
  • 2020-12-25 14:43

    Consider the following dataframes df and df2

    df = pd.DataFrame(dict(
            AUTHOR_NAME=list('AAABBCCCCDEEFGG'),
            title=      list('zyxwvutsrqponml')
        ))
    
    df2 = pd.DataFrame(dict(
            AUTHOR_NAME=list('AABCCEGG'),
            title      =list('zwvtrpml'),
            CATEGORY   =list('11223344')
        ))
    

    option 1
    merge

    df.merge(df2, how='left')
    

    option 2
    join

    cols = ['AUTHOR_NAME', 'title']
    df.join(df2.set_index(cols), on=cols)
    

    both options yield

    0 讨论(0)
  • 2020-12-25 14:44

    While the other answers here give very good and elegant solutions to the asked question, I have found a resource that both answers this question in an extremely elegant fashion, as well as giving a beautifully clear and straightforward set of examples on how to accomplish join/ merge of dataframes, effectively teaching LEFT, RIGHT, INNER and OUTER joins.

    Join And Merge Pandas Dataframe

    I honestly feel any further seekers after this topic will want to also examine his examples...

    0 讨论(0)
  • 2020-12-25 14:52

    Try

    df = df.combine_first(df2)
    
    0 讨论(0)
  • 2020-12-25 14:57

    APPROACH 1:

    You could use concat instead and drop the duplicated values present in both Index and AUTHOR_NAME columns combined. After that, use isin for checking membership:

    df_concat = pd.concat([df2, df]).reset_index().drop_duplicates(['Index', 'AUTHOR_NAME'])
    df_concat.set_index('Index', inplace=True)
    df_concat[df_concat.index.isin(df.index)]
    

    Note: The column Index is assumed to be set as the index column for both the DF's.


    APPROACH 2:

    Use join after setting the index column correctly as shown:

    df2.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
    df.set_index(['Index', 'AUTHOR_NAME'], inplace=True)
    
    df.join(df2).reset_index()
    

    0 讨论(0)
  • 2020-12-25 15:09

    You may try the following. It will merge both the datasets on specified column as key.

    expected_result = pd.merge(df, df2, on = 'CATEGORY', how = 'left')
    
    0 讨论(0)
提交回复
热议问题