Detecting almost duplicate rows

前端 未结 2 1420
花落未央
花落未央 2021-01-14 14:08

Let\'s say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using



        
相关标签:
2条回答
  • 2021-01-14 15:01

    use numpy and triangle indexing to map all combinations

    day = df.DAY.values
    val = df.VALUE.values
    
    i, j = np.triu_indices(len(df), k=1)
    c1 = np.abs(day[i] - day[j]) < 2
    c2 = np.abs(val[i] - val[j]) < 10
    
    c = c1 & c2
    df.iloc[np.unique(np.append(i[c], j[c]))]
    
        DAY  MTH   YYY  VALUE    NAME
    1    22    9  2016   43.0    John
    6    24    8  2016   10.0    Mike
    7    24    9  2016   10.0    Mike
    8    24   10  2016   10.0    Mike
    9    24   11  2016   10.0    Mike
    10   13    9  2016  170.0  Kathie
    11   13   10  2016  170.0  Kathie
    13    8    9  2016   16.0    Gina
    14    9   10  2016   16.0    Gina
    15    8   11  2016   16.0    Gina
    17   21   11  2016   45.0    Ross
    18   23    9  2016   50.0   Shari
    19   23   10  2016   50.0   Shari
    20   23   11  2016   50.0   Shari
    
    0 讨论(0)
  • 2021-01-14 15:07

    Brute forcing this:

        df_data = df_data.sort_values(['DAY','VALUE'])
        df_data['Dup'] = False
    
        prev_row = pd.Series()
        prev_idx = None
        for idx, row in df_data.iterrows():
            if not prev_row.empty:
                if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
                   (abs(row['VALUE'] - prev_row['VALUE']) <=10):
                    df_data['Dup'][idx] = True
                    df_data['Dup'][prev_idx] = True
            prev_row, prev_idx  = row, idx
    
        print df_data
    

    gives:

        DAY  MTH   YYY   VALUE    Dup
    3     2   10  2016   50.00  False
    2     6   11  2016   28.25  False
    13    8    9  2016   16.00   True
    15    8   11  2016   16.00   True
    14    9   10  2016   16.00   True
    12   13   11  2016  160.00   True
    10   13    9  2016  170.00   True
    11   13   10  2016  170.00   True
    16   16   11  2016   25.00  False
    17   21   11  2016   45.00  False
    0    22    9  2016    8.25  False
    1    22    9  2016   43.00  False
    5    23   10  2016   30.00  False
    18   23    9  2016   50.00   True
    19   23   10  2016   50.00   True
    20   23   11  2016   50.00   True
    4    23   11  2016   90.00  False
    6    24    8  2016   10.00   True
    7    24    9  2016   10.00   True
    8    24   10  2016   10.00   True
    9    24   11  2016   10.00   True
    

    Is this the desired outcome?

    0 讨论(0)
提交回复
热议问题