Detecting almost duplicate rows

前端 未结 2 1421
花落未央
花落未央 2021-01-14 14:08

Let\'s say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using



        
2条回答
  •  悲&欢浪女
    2021-01-14 15:07

    Brute forcing this:

        df_data = df_data.sort_values(['DAY','VALUE'])
        df_data['Dup'] = False
    
        prev_row = pd.Series()
        prev_idx = None
        for idx, row in df_data.iterrows():
            if not prev_row.empty:
                if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
                   (abs(row['VALUE'] - prev_row['VALUE']) <=10):
                    df_data['Dup'][idx] = True
                    df_data['Dup'][prev_idx] = True
            prev_row, prev_idx  = row, idx
    
        print df_data
    

    gives:

        DAY  MTH   YYY   VALUE    Dup
    3     2   10  2016   50.00  False
    2     6   11  2016   28.25  False
    13    8    9  2016   16.00   True
    15    8   11  2016   16.00   True
    14    9   10  2016   16.00   True
    12   13   11  2016  160.00   True
    10   13    9  2016  170.00   True
    11   13   10  2016  170.00   True
    16   16   11  2016   25.00  False
    17   21   11  2016   45.00  False
    0    22    9  2016    8.25  False
    1    22    9  2016   43.00  False
    5    23   10  2016   30.00  False
    18   23    9  2016   50.00   True
    19   23   10  2016   50.00   True
    20   23   11  2016   50.00   True
    4    23   11  2016   90.00  False
    6    24    8  2016   10.00   True
    7    24    9  2016   10.00   True
    8    24   10  2016   10.00   True
    9    24   11  2016   10.00   True
    

    Is this the desired outcome?

提交回复
热议问题