Detecting almost duplicate rows

前端未结

关注

 2  1420

Let\'s say I have a table that has dates and a value for each date (plus other columns). I can find the rows that have the same value on the same day by using

相关标签:

2条回答

青春惊慌失措

2021-01-14 15:01

use numpy and triangle indexing to map all combinations

day = df.DAY.values
val = df.VALUE.values

i, j = np.triu_indices(len(df), k=1)
c1 = np.abs(day[i] - day[j]) < 2
c2 = np.abs(val[i] - val[j]) < 10

c = c1 & c2
df.iloc[np.unique(np.append(i[c], j[c]))]

    DAY  MTH   YYY  VALUE    NAME
1    22    9  2016   43.0    John
6    24    8  2016   10.0    Mike
7    24    9  2016   10.0    Mike
8    24   10  2016   10.0    Mike
9    24   11  2016   10.0    Mike
10   13    9  2016  170.0  Kathie
11   13   10  2016  170.0  Kathie
13    8    9  2016   16.0    Gina
14    9   10  2016   16.0    Gina
15    8   11  2016   16.0    Gina
17   21   11  2016   45.0    Ross
18   23    9  2016   50.0   Shari
19   23   10  2016   50.0   Shari
20   23   11  2016   50.0   Shari

0 讨论(0)

悲&欢浪女

2021-01-14 15:07

Brute forcing this:

    df_data = df_data.sort_values(['DAY','VALUE'])
    df_data['Dup'] = False

    prev_row = pd.Series()
    prev_idx = None
    for idx, row in df_data.iterrows():
        if not prev_row.empty:
            if (abs(row['DAY'] - prev_row['DAY']) <=2) & \
               (abs(row['VALUE'] - prev_row['VALUE']) <=10):
                df_data['Dup'][idx] = True
                df_data['Dup'][prev_idx] = True
        prev_row, prev_idx  = row, idx

    print df_data

gives:

    DAY  MTH   YYY   VALUE    Dup
3     2   10  2016   50.00  False
2     6   11  2016   28.25  False
13    8    9  2016   16.00   True
15    8   11  2016   16.00   True
14    9   10  2016   16.00   True
12   13   11  2016  160.00   True
10   13    9  2016  170.00   True
11   13   10  2016  170.00   True
16   16   11  2016   25.00  False
17   21   11  2016   45.00  False
0    22    9  2016    8.25  False
1    22    9  2016   43.00  False
5    23   10  2016   30.00  False
18   23    9  2016   50.00   True
19   23   10  2016   50.00   True
20   23   11  2016   50.00   True
4    23   11  2016   90.00  False
6    24    8  2016   10.00   True
7    24    9  2016   10.00   True
8    24   10  2016   10.00   True
9    24   11  2016   10.00   True

Is this the desired outcome?

0 讨论(0)