Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

匿名 (未验证) 提交于 2019-12-03 02:45:02

问题:

I'm reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I'm done parsing a file, the DataFrame looks something like this:

                      Sta  Precip1hr  Precip5min  Temp  DewPnt  WindSpd  WindDir  AtmPress Date                                                                                       2001-01-01 00:00:00  KPDX          0           0     4       3        0        0     30.31 2001-01-01 00:05:00  KPDX          0           0     4       3        0        0     30.30 2001-01-01 00:10:00  KPDX          0           0     4       3        4       80     30.30 2001-01-01 00:15:00  KPDX          0           0     3       2        5       90     30.30 2001-01-01 00:20:00  KPDX          0           0     3       2       10      110     30.28 

The problem I'm having is that sometimes a scientist goes back and corrects observations -- not by editing the erroneous rows, but by appending a duplicate row to the end of a file. Simple example of such a case is illustrated below:

import pandas  import datetime startdate = datetime.datetime(2001, 1, 1, 0, 0) enddate = datetime.datetime(2001, 1, 1, 5, 0) index = pandas.DatetimeIndex(start=startdate, end=enddate, freq='H') data = {'A' : range(6), 'B' : range(6)} data1 = {'A' : [20, -30, 40], 'B' : [-50, 60, -70]} df1 = pandas.DataFrame(data=data, index=index) df2 = pandas.DataFrame(data=data1, index=index[:3]) df3 = df1.append(df2) df3                        A   B 2001-01-01 00:00:00   20 -50 2001-01-01 01:00:00  -30  60 2001-01-01 02:00:00   40 -70 2001-01-01 03:00:00    3   3 2001-01-01 04:00:00    4   4 2001-01-01 05:00:00    5   5 2001-01-01 00:00:00    0   0 2001-01-01 01:00:00    1   1 2001-01-01 02:00:00    2   2 

And so I need df3 to evenutally become:

                       A   B 2001-01-01 00:00:00    0   0 2001-01-01 01:00:00    1   1 2001-01-01 02:00:00    2   2 2001-01-01 03:00:00    3   3 2001-01-01 04:00:00    4   4 2001-01-01 05:00:00    5   5 

I thought that adding a column of row numbers (df3['rownum'] = range(df3.shape[0])) would help me select out the bottom-most row for any value of the DatetimeIndex, but I am stuck on figuring out the group_by or pivot (or ???) statements to make that work.

回答1:

Note, there is a better answer (below) based on the latest Pandas

This should be the accepted answer.

My original answer, which is now outdated, kept for reference.

A simple solution is to use drop_duplicates

df4 = df3.drop_duplicates(subset='rownum', keep='last') 

For me, this operated quickly on large data sets.

This requires that 'rownum' be the column with duplicates. In the modified example, 'rownum' has no duplicates, therefore nothing gets eliminated. What we really want is to have the 'cols' be set to the index. I've not found a way to tell drop_duplicates to only consider the index.

Here is a solution that adds the index as a dataframe column, drops duplicates on that, then removes the new column:

df3 = df3.reset_index().drop_duplicates(subset='index', keep='last').set_index('index') 

And if you want things back in the proper order, just call sort on the dataframe.

df3 = df3.sort() 


回答2:

I would suggest using the duplicated method on the Pandas Index itself:

df3 = df3[~df3.index.duplicated(keep='first')] 

While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!