I\'m reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I\'m done pa
This adds the index as a dataframe column, drops duplicates on that, then removes the new column:
df = df.reset_index().drop_duplicates(subset='index', keep='last').set_index('index').sort_index()
Note that the use of .sort_index()
above at the end is as needed and is optional.
Oh my. This is actually so simple!
grouped = df3.groupby(level=0)
df4 = grouped.last()
df4
A B rownum
2001-01-01 00:00:00 0 0 6
2001-01-01 01:00:00 1 1 7
2001-01-01 02:00:00 2 2 8
2001-01-01 03:00:00 3 3 3
2001-01-01 04:00:00 4 4 4
2001-01-01 05:00:00 5 5 5
Follow up edit 2013-10-29
In the case where I have a fairly complex MultiIndex
, I think I prefer the groupby
approach. Here's simple example for posterity:
import numpy as np
import pandas
# fake index
idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])
# random data + naming the index levels
df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
df1.index.names = ['iA', 'iB']
# artificially append some duplicate data
df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
df1
# colA colB
#iA iB
#a a -1.297535 0.691787
# b -1.688411 0.404430
# c 0.275806 -0.078871
# d -0.509815 -0.220326
# e -0.066680 0.607233
# c 0.275806 -0.078871 # <--- dup 1
# e -0.066680 0.607233 # <--- dup 2
and here's the important part
# group the data, using df1.index.names tells pandas to look at the entire index
groups = df1.groupby(level=df1.index.names)
groups.last() # or .first()
# colA colB
#iA iB
#a a -1.297535 0.691787
# b -1.688411 0.404430
# c 0.275806 -0.078871
# d -0.509815 -0.220326
# e -0.066680 0.607233
If anyone like me likes chainable data manipulation using the pandas dot notation (like piping), then the following may be useful:
df3 = df3.query('~index.duplicated()')
This enables chaining statements like this:
df3.assign(C=2).query('~index.duplicated()').mean()
Remove duplicates (Keeping First)
idx = np.unique( df.index.values, return_index = True )[1]
df = df.iloc[idx]
Remove duplicates (Keeping Last)
df = df[::-1]
df = df.iloc[ np.unique( df.index.values, return_index = True )[1] ]
Tests: 10k loops using OP's data
numpy method - 3.03 seconds
df.loc[~df.index.duplicated(keep='first')] - 4.43 seconds
df.groupby(df.index).first() - 21 seconds
reset_index() method - 29 seconds
I would suggest using the duplicated method on the Pandas Index itself:
df3 = df3[~df3.index.duplicated(keep='first')]
While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.
Using the sample data provided:
>>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
1000 loops, best of 3: 1.54 ms per loop
>>> %timeit df3.groupby(df3.index).first()
1000 loops, best of 3: 580 µs per loop
>>> %timeit df3[~df3.index.duplicated(keep='first')]
1000 loops, best of 3: 307 µs per loop
Note that you can keep the last element by changing the keep argument to 'last'
.
It should also be noted that this method works with MultiIndex
as well (using df1 as specified in Paul's example):
>>> %timeit df1.groupby(level=df1.index.names).last()
1000 loops, best of 3: 771 µs per loop
>>> %timeit df1[~df1.index.duplicated(keep='last')]
1000 loops, best of 3: 365 µs per loop
Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:
df3 = df3.reset_index() # makes date column part of your data
df3.columns = ['timestamp','A','B','rownum'] # set names
df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!