问题
Sample Data
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 90, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 90, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 90, "time": "2019-02-13T11:02:30.000Z"}}
.
.
I have some code like this
df = pd.DataFrame()
for line in sys.stdin:
data = json.loads(line)
# df1 = pd.DataFrame(data["transaction"], index=[len(df.index)])
df1 = pd.DataFrame(data["transaction"], index=[data['transaction']['time']])
df1['time'] = pd.to_datetime(df1['time'])
df = df.append(df1)
# df['count'] = df.rolling('2min', on='time', min_periods=1)['amount'].count()
print(df)
print(len(df[df.merchant.eq(data['transaction']['merchant']) & df.amount.eq(data['transaction']['amount'])].index))
Current output
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
2019-02-13T11:02:30.000Z merchantE 90 2019-02-13 11:02:30
2
Expected output
2019-02-13T10:00:00.000Z merchantA 20 2019-02-13 10:00:00
2019-02-13T11:00:01.000Z merchantB 90 2019-02-13 11:00:01
2019-02-13T11:00:10.000Z merchantC 90 2019-02-13 11:00:10
2019-02-13T11:00:20.000Z merchantD 90 2019-02-13 11:00:20
2019-02-13T11:01:30.000Z merchantE 90 2019-02-13 11:01:30
As the data is streaming. I want to check if a duplicate record(whose merchant and amount value are same) arrives withing two minutes so I discard it as and do no processing on it. print it as a duplicate.
Do I have to do something with index zipping or groupby? but then how to equate of multiple columns. Or some rolling condition on two columns but can't find anything how to do it.
What am I missing here?
Thanks
EDIT
#dup = df[df.duplicated(subset=['merchant', 'amount'], keep=False)]
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
# res['timediff'] = pd.to_timedelta((data['transaction']['time'] - res['time']), unit='T')
res['timediff'] = (data['transaction']['time'] - res['time'])
if len(res.index) >1:
print(res)
so im trying something like this and if the result is less than 120 seconds i can process it. But the resulting df in currently in the form of
merchant amount time concat timediff
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00 merchantF10 -1 days +23:59:20
2019-02-13 11:02:20 merchantF 10 2019-02-13 11:02:20 merchantF10 00:00:00
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30 merchantE10 00:01:00
2019-02-13 11:02:00 merchantE 10 2019-02-13 11:02:00 merchantE10 00:00:30
2019-02-13 11:02:30 merchantE 10 2019-02-13 11:02:30 merchantE10 00:00:00
-1 days +23:59:20 this format I think can be delt with taking Absolute value?
how can I convert the time in a format that I can compare it with 120 seconds? pd.to_deltatime() didn't work for me or maybe I'm using it wrong.
回答1:
First, you could form rolling 120 second blocs of data. You could then apply;
block and evaluate using duplicated: df = df[df.duplicated(subset=['val1','val2',’val3’], keep=False)]
Or groupby: df.groupby(['val1','val2',’val3’]).count()
Or even a SQL distinct. https://www.w3schools.com/sql/sql_distinct.asp
Please post what you have tried. The above methods work for strings, floats, datetimes and integer data types.
回答2:
So i made it work but not with rolling windows as it doesn't support string type. the feature is reported and requested on Pandas Repo as well.
My solution snippet to the problem:
if len(df.index) > 0:
res = df.loc[(df.merchant == data['transaction']['merchant']) & (df.amount == data['transaction']['amount'])]
res['timediff'] = (data['transaction']['time'] - res['time']).dt.total_seconds().abs() <= 120
if res.timediff.any():
continue
df = df.append(df1)
print(df)
Sample data:
{"transaction": {"merchant": "merchantA", "amount": 20, "time": "2019-02-13T10:00:00.000Z"}}
{"transaction": {"merchant": "merchantB", "amount": 90, "time": "2019-02-13T11:00:01.000Z"}}
{"transaction": {"merchant": "merchantC", "amount": 10, "time": "2019-02-13T11:00:10.000Z"}}
{"transaction": {"merchant": "merchantD", "amount": 10, "time": "2019-02-13T11:00:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:01:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:03:00.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:00.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:02:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:02:30.000Z"}}
{"transaction": {"merchant": "merchantF", "amount": 10, "time": "2019-02-13T11:05:20.000Z"}}
{"transaction": {"merchant": "merchantE", "amount": 10, "time": "2019-02-13T11:00:30.000Z"}}
Output:
merchant amount time
2019-02-13 10:00:00 merchantA 20 2019-02-13 10:00:00
2019-02-13 11:00:01 merchantB 90 2019-02-13 11:00:01
2019-02-13 11:00:10 merchantC 10 2019-02-13 11:00:10
2019-02-13 11:00:20 merchantD 10 2019-02-13 11:00:20
2019-02-13 11:01:30 merchantE 10 2019-02-13 11:01:30
2019-02-13 11:03:00 merchantF 10 2019-02-13 11:03:00
2019-02-13 11:05:20 merchantF 10 2019-02-13 11:05:20
来源:https://stackoverflow.com/questions/60285964/how-to-find-duplicate-based-upon-multiple-columns-in-a-rolling-window-in-pandas