How to find duplicate based upon multiple columns in a rolling window in pandas?

问题

Sample Data

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.

I have some code like this

    df = pd.DataFrame()
for line in sys.stdin:
    data = json.loads(line)
    # df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
    df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
    df1['time'] = pd.to_datetime(df1['time'])
    df = df.append(df1)
    # df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()

print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))

Current output

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z  merchantE      90 2019-02-13 11:02:30

2

Expected output

2019-02-13T10:00:00.000Z  merchantA      20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z  merchantB      90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z  merchantC      90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z  merchantD      90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z  merchantE      90 2019-02-13 11:01:30

As the data is streaming. I want to check if a duplicate record(whose merchant and amount value are same) arrives withing two minutes so I discard it as and do no processing on it. print it as a duplicate.

Do I have to do something with index zipping or groupby? but then how to equate of multiple columns. Or some rolling condition on two columns but can't find anything how to do it.

What am I missing here?

Thanks

EDIT

#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
     res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        # res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
        res['timediff'] = (data['transaction']['time'] - res['time'])
        if len(res.index) >1:
           print(res)

so im trying something like this and if the result is less than 120 seconds i can process it. But the resulting df in currently in the form of

                      merchant  amount                time       concat          timediff
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00  merchantF10 -1 days +23:59:20
2019-02-13 11:02:20  merchantF      10 2019-02-13 11:02:20  merchantF10          00:00:00

2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30  merchantE10 00:01:00
2019-02-13 11:02:00  merchantE      10 2019-02-13 11:02:00  merchantE10 00:00:30
2019-02-13 11:02:30  merchantE      10 2019-02-13 11:02:30  merchantE10 00:00:00

-1 days +23:59:20 this format I think can be delt with taking Absolute value?

how can I convert the time in a format that I can compare it with 120 seconds? pd.to_deltatime() didn't work for me or maybe I'm using it wrong.

回答1:

First, you could form rolling 120 second blocs of data. You could then apply;

block and evaluate using duplicated: df = df[df.duplicated(subset=['val1','val2',’val3’], keep=False)]

Or groupby: df.groupby(['val1','val2',’val3’]).count()

Or even a SQL distinct. https://www.w3schools.com/sql/sql_distinct.asp

Please post what you have tried. The above methods work for strings, floats, datetimes and integer data types.

回答2:

So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.

My solution snippet to the problem:

    if len(df.index) > 0:
        res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
        res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
        if res.timediff.any():
            continue
    df = df.append(df1)
print(df)

Sample data:

{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}

Output:

                      merchant  amount                time
2019-02-13 10:00:00  merchantA      20 2019-02-13 10:00:00
2019-02-13 11:00:01  merchantB      90 2019-02-13 11:00:01
2019-02-13 11:00:10  merchantC      10 2019-02-13 11:00:10
2019-02-13 11:00:20  merchantD      10 2019-02-13 11:00:20
2019-02-13 11:01:30  merchantE      10 2019-02-13 11:01:30
2019-02-13 11:03:00  merchantF      10 2019-02-13 11:03:00
2019-02-13 11:05:20  merchantF      10 2019-02-13 11:05:20

来源：https://stackoverflow.com/questions/60285964/how-to-find-duplicate-based-upon-multiple-columns-in-a-rolling-window-in-pandas

标签

python

sql

pandas

duplicates

rolling-computation