pandas get rows which are NOT in other dataframe

后端 未结 13 871
春和景丽
春和景丽 2020-11-22 02:17

I\'ve two pandas data frames which have some rows in common.

Suppose dataframe2 is a subset of dataframe1.

How can I get the rows of dataframe1 which

相关标签:
13条回答
  • 2020-11-22 02:48

    Suppose you have two dataframes, df_1 and df_2 having multiple fields(column_names) and you want to find the only those entries in df_1 that are not in df_2 on the basis of some fields(e.g. fields_x, fields_y), follow the following steps.

    Step1.Add a column key1 and key2 to df_1 and df_2 respectively.

    Step2.Merge the dataframes as shown below. field_x and field_y are our desired columns.

    Step3.Select only those rows from df_1 where key1 is not equal to key2.

    Step4.Drop key1 and key2.

    This method will solve your problem and works fast even with big data sets. I have tried it for dataframes with more than 1,000,000 rows.

    df_1['key1'] = 1
    df_2['key2'] = 1
    df_1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'left')
    df_1 = df_1[~(df_1.key2 == df_1.key1)]
    df_1 = df_1.drop(['key1','key2'], axis=1)
    
    0 讨论(0)
  • 2020-11-22 02:48

    Here is another way of solving this:

    df1[~df1.index.isin(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
    

    Or:

    df1.loc[df1.index.difference(df1.merge(df2, how='inner', on=['col1', 'col2']).index)]
    
    0 讨论(0)
  • 2020-11-22 02:48

    My way of doing this involves adding a new column that is unique to one dataframe and using this to choose whether to keep an entry

    df2[col3] = 1
    df1 = pd.merge(df_1, df_2, on=['field_x', 'field_y'], how = 'outer')
    df1['Empt'].fillna(0, inplace=True)
    

    This makes it so every entry in df1 has a code - 0 if it is unique to df1, 1 if it is in both dataFrames. You then use this to restrict to what you want

    answer = nonuni[nonuni['Empt'] == 0]
    
    0 讨论(0)
  • 2020-11-22 02:50

    a bit late, but it might be worth checking the "indicator" parameter of pd.merge.

    See this other question for an example: Compare PandaS DataFrames and return rows that are missing from the first one

    0 讨论(0)
  • 2020-11-22 02:52

    How about this:

    df1 = pandas.DataFrame(data = {'col1' : [1, 2, 3, 4, 5], 
                                   'col2' : [10, 11, 12, 13, 14]}) 
    df2 = pandas.DataFrame(data = {'col1' : [1, 2, 3], 
                                   'col2' : [10, 11, 12]})
    records_df2 = set([tuple(row) for row in df2.values])
    in_df2_mask = np.array([tuple(row) in records_df2 for row in df1.values])
    result = df1[~in_df2_mask]
    
    0 讨论(0)
  • 2020-11-22 02:52
    extract the dissimilar rows using the merge function
    df = df.merge(same.drop_duplicates(), on=['col1','col2'], 
                   how='left', indicator=True)
    
    save the dissimilar rows in CSV
    df[df['_merge'] == 'left_only'].to_csv('output.csv')
    
    0 讨论(0)
提交回复
热议问题