问题
I have the following returned from an API Call as part of a larger dataset:
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052600'}
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052500'}
Ideally I would use the timestamp as an index on the pandas data frame however this appears to fail as there is a duplicate when converting to JSON:
df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))
ValueError: DataFrame index must be unique for orient='index'.
Any guidance on the best way to deal with this situation? Throw away one datapoint? The time does not get more fine-grain than to the second, and there is obviously a price change during that second.
回答1:
I think you can change duplicates datetimes by adding ms
by cumcount and to_timedelta:
d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
Price Time
0 0.052600 2017-05-21 18:18:01
1 0.052500 2017-05-21 18:18:01
print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0 00:00:00
1 00:00:00.001000
dtype: timedelta64[ns]
df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
Price Time
0 0.052600 2017-05-21 18:18:01.000
1 0.052500 2017-05-21 18:18:01.001
new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}
回答2:
You could use .duplicated to keep first or last entry. Have a look at pandas.DataFrame.duplicated
回答3:
Just to expand upon the accepted answer: adding a loop helps for dealing with any new duplicates introduced by the first pass.
This isnull
is important to catch any NaTs in your data. Since any timedelta + NaT
is still NaT
.
def deduplicate_start_times(frame, col='start_time', max_iterations=10):
"""
Fuzz duplicate start times from a frame so we can stack and unstack
this frame.
"""
for _ in range(max_iterations):
dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])
if not dups.any():
break
LOGGER.debug("Removing %i duplicates", dups.sum())
# Add several ms of time to each time
frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
unit='ms')
else:
LOGGER.error("Exceeded max iterations removing duplicates. "
"%i duplicates remain", dups.sum())
return frame
来源:https://stackoverflow.com/questions/44128600/how-should-i-handle-duplicate-times-in-time-series-data-with-pandas