可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I'm reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I'm done parsing a file, the DataFrame looks something like this:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress Date                                                                                       2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31 2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30 2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30 2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30 2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28

The problem I'm having is that sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file. Simple example of such a case is illustrated below:

import pandas  import datetime startdate = datetime.datetime(2001, 1, 1, 0, 0) enddate = datetime.datetime(2001, 1, 1, 5, 0) index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H') data = {'A' : range(6), 'B' : range(6)} data1 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]} df1 = pandas.DataFrame(data=data, index=index) df2 = pandas.DataFrame(data=data1, index=index[:3]) df3 = df1.append(df2) df3                        A   B 2001-01-01 00:00:00   20 -50 2001-01-01 01:00:00  -30  60 2001-01-01 02:00:00   40 -70 2001-01-01 03:00:00    3   3 2001-01-01 04:00:00    4   4 2001-01-01 05:00:00    5   5 2001-01-01 00:00:00    0   0 2001-01-01 01:00:00    1   1 2001-01-01 02:00:00    2   2

And so I need df3 to evenutally become:

                       A   B 2001-01-01 00:00:00    0   0 2001-01-01 01:00:00    1   1 2001-01-01 02:00:00    2   2 2001-01-01 03:00:00    3   3 2001-01-01 04:00:00    4   4 2001-01-01 05:00:00    5   5

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select out the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

回答1:

Note, there is a better answer (below) based on the latest Pandas

This should be the accepted answer.

My original answer, which is now outdated, kept for reference.

A simple solution is to use drop_duplicates

df4 = df3.drop_duplicates(subset='rownum', keep='last')

For me, this operated quickly on large data sets.

This requires that 'rownum' be the column with duplicates. In the modified example, 'rownum' has no duplicates, therefore nothing gets eliminated. What we really want is to have the 'cols' be set to the index. I've not found a way to tell drop_duplicates to only consider the index.

Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:

df3 = df3.reset_index().drop_duplicates(subset='index', keep='last').set_index('index')

And if you want things back in the proper order, just call sort on the dataframe.

df3 = df3.sort()

回答2:

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')]

While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

标签

pan

dataframe

Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

问题:

回答1:

Note, there is a better answer (below) based on the latest Pandas

My original answer, which is now outdated, kept for reference.

回答2: