How should I Handle duplicate times in time series data with pandas?

问题

I have the following returned from an API Call as part of a larger dataset:

{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052600'}

{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1, tzinfo=tzutc()), 'Price': '0.052500'}

Ideally I would use the timestamp as an index on the pandas data frame however this appears to fail as there is a duplicate when converting to JSON:

df = df.set_index(pd.to_datetime(df['Timestamp']))
print(new_df.to_json(orient='index'))

ValueError: DataFrame index must be unique for orient='index'.

Any guidance on the best way to deal with this situation? Throw away one datapoint? The time does not get more fine-grain than to the second, and there is obviously a price change during that second.

回答1:

I think you can change duplicates datetimes by adding ms by cumcount and to_timedelta:

d = [{'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052600'},
     {'Time': datetime.datetime(2017, 5, 21, 18, 18, 1), 'Price': '0.052500'}]
df = pd.DataFrame(d)
print (df)
      Price                Time
0  0.052600 2017-05-21 18:18:01
1  0.052500 2017-05-21 18:18:01

print (pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms'))
0          00:00:00
1   00:00:00.001000
dtype: timedelta64[ns]

df['Time'] = df['Time'] + pd.to_timedelta(df.groupby('Time').cumcount(), unit='ms')
print (df)
      Price                    Time
0  0.052600 2017-05-21 18:18:01.000
1  0.052500 2017-05-21 18:18:01.001

new_df = df.set_index('Time')
print(new_df.to_json(orient='index'))
{"1495390681000":{"Price":"0.052600"},"1495390681001":{"Price":"0.052500"}}

回答2:

You could use .duplicated to keep first or last entry. Have a look at pandas.DataFrame.duplicated

回答3:

Just to expand upon the accepted answer: adding a loop helps for dealing with any new duplicates introduced by the first pass.

This isnull is important to catch any NaTs in your data. Since any timedelta + NaT is still NaT.

def deduplicate_start_times(frame, col='start_time', max_iterations=10):
    """
    Fuzz duplicate start times from a frame so we can stack and unstack
    this frame.
    """

    for _ in range(max_iterations):
        dups = frame.duplicated(subset=col) & ~pandas.isnull(frame[col])

        if not dups.any():
            break

        LOGGER.debug("Removing %i duplicates", dups.sum())

        # Add several ms of time to each time
        frame[col] += pandas.to_timedelta(frame.groupby(col).cumcount(),
                                          unit='ms')

    else:
        LOGGER.error("Exceeded max iterations removing duplicates. "
                     "%i duplicates remain", dups.sum())

    return frame

来源：https://stackoverflow.com/questions/44128600/how-should-i-handle-duplicate-times-in-time-series-data-with-pandas

标签

python

pandas

time-series

data-processing