Pandas/Python: How to concatenate two dataframes without duplicates?

前端 未结 3 1078
执念已碎
执念已碎 2020-11-28 23:33

I\'d like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don\'t add):

Dataframe A: Dataframe B:



        
相关标签:
3条回答
  • 2020-11-29 00:05

    I'm surprised that pandas doesn't offer a native solution for this task. I don't think that it's efficient to just drop the duplicates if you work with large datasets (as Rian G suggested).

    It is probably most efficient to use sets to find the non-overlapping indices. Then use list comprehension to translate from index to 'row location' (boolean), which you need to access rows using iloc[,]. Below you find a function that performs the task. If you don't choose a specific column (col) to check for duplicates, then indexes will be used, as you requested. If you chose a specific column, be aware that existing duplicate entries in 'a' will remain in the result.

    import pandas as pd
    
    def append_non_duplicates(a, b, col=None):
        if ((a is not None and type(a) is not pd.core.frame.DataFrame) or (b is not None and type(b) is not pd.core.frame.DataFrame)):
            raise ValueError('a and b must be of type pandas.core.frame.DataFrame.')
        if (a is None):
            return(b)
        if (b is None):
            return(a)
        if(col is not None):
            aind = a.iloc[:,col].values
            bind = b.iloc[:,col].values
        else:
            aind = a.index.values
            bind = b.index.values
        take_rows = list(set(bind)-set(aind))
        take_rows = [i in take_rows for i in bind]
        return(a.append( b.iloc[take_rows,:] ))
    
    # Usage
    a = pd.DataFrame([[1,2,3],[1,5,6],[1,12,13]], index=[1000,2000,5000])
    b = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9]], index=[1000,2000,3000])
    
    append_non_duplicates(a,b)
    #        0   1   2
    # 1000   1   2   3    <- from a
    # 2000   1   5   6    <- from a
    # 5000   1  12  13    <- from a
    # 3000   7   8   9    <- from b
    
    append_non_duplicates(a,b,0)
    #       0   1   2
    # 1000  1   2   3    <- from a
    # 2000  1   5   6    <- from a
    # 5000  1  12  13    <- from a
    # 2000  4   5   6    <- from b
    # 3000  7   8   9    <- from b
    
    0 讨论(0)
  • 2020-11-29 00:06

    The simplest way is to just do the concatenation, and then drop duplicates.

    >>> df1
       A  B
    0  1  2
    1  3  1
    >>> df2
       A  B
    0  5  6
    1  3  1
    >>> pandas.concat([df1,df2]).drop_duplicates().reset_index(drop=True)
       A  B
    0  1  2
    1  3  1
    2  5  6
    

    The reset_index(drop=True) is to fix up the index after the concat() and drop_duplicates(). Without it you will have an index of [0,1,0] instead of [0,1,2]. This could cause problems for further operations on this dataframe down the road if it isn't reset right away.

    0 讨论(0)
  • 2020-11-29 00:29

    In case you have a duplicate row already in DataFrame A, then concatenating and then dropping duplicate rows, will remove rows from DataFrame A that you might want to keep.

    In this case, you will need to create a new column with a cumulative count, and then drop duplicates, it all depends on your use case, but this is common in time-series data

    Here is an example:

    df_1 = pd.DataFrame([
    {'date':'11/20/2015', 'id':4, 'value':24},
    {'date':'11/20/2015', 'id':4, 'value':24},
    {'date':'11/20/2015', 'id':6, 'value':34},])
    
    df_2 = pd.DataFrame([
    {'date':'11/20/2015', 'id':4, 'value':24},
    {'date':'11/20/2015', 'id':6, 'value':14},
    ])
    
    
    df_1['count'] = df_1.groupby(['date','id','value']).cumcount()
    df_2['count'] = df_2.groupby(['date','id','value']).cumcount()
    
    df_tot = pd.concat([df_1,df_2], ignore_index=False)
    df_tot = df_tot.drop_duplicates()
    df_tot = df_tot.drop(['count'], axis=1)
    >>> df_tot
    
    date    id  value
    0   11/20/2015  4   24
    1   11/20/2015  4   24
    2   11/20/2015  6   34
    1   11/20/2015  6   14
    
    0 讨论(0)
提交回复
热议问题