Remove rows with duplicate indices (Pandas DataFrame and TimeSeries)

后端 未结 6 835
我寻月下人不归
我寻月下人不归 2020-11-22 14:31

I\'m reading some automated weather data from the web. The observations occur every 5 minutes and are compiled into monthly files for each weather station. Once I\'m done pa

相关标签:
6条回答
  • 2020-11-22 15:14

    This adds the index as a dataframe column, drops duplicates on that, then removes the new column:

    df = df.reset_index().drop_duplicates(subset='index', keep='last').set_index('index').sort_index()
    

    Note that the use of .sort_index() above at the end is as needed and is optional.

    0 讨论(0)
  • 2020-11-22 15:15

    Oh my. This is actually so simple!

    grouped = df3.groupby(level=0)
    df4 = grouped.last()
    df4
                          A   B  rownum
    
    2001-01-01 00:00:00   0   0       6
    2001-01-01 01:00:00   1   1       7
    2001-01-01 02:00:00   2   2       8
    2001-01-01 03:00:00   3   3       3
    2001-01-01 04:00:00   4   4       4
    2001-01-01 05:00:00   5   5       5
    

    Follow up edit 2013-10-29 In the case where I have a fairly complex MultiIndex, I think I prefer the groupby approach. Here's simple example for posterity:

    import numpy as np
    import pandas
    
    # fake index
    idx = pandas.MultiIndex.from_tuples([('a', letter) for letter in list('abcde')])
    
    # random data + naming the index levels
    df1 = pandas.DataFrame(np.random.normal(size=(5,2)), index=idx, columns=['colA', 'colB'])
    df1.index.names = ['iA', 'iB']
    
    # artificially append some duplicate data
    df1 = df1.append(df1.select(lambda idx: idx[1] in ['c', 'e']))
    df1
    #           colA      colB
    #iA iB                    
    #a  a  -1.297535  0.691787
    #   b  -1.688411  0.404430
    #   c   0.275806 -0.078871
    #   d  -0.509815 -0.220326
    #   e  -0.066680  0.607233
    #   c   0.275806 -0.078871  # <--- dup 1
    #   e  -0.066680  0.607233  # <--- dup 2
    

    and here's the important part

    # group the data, using df1.index.names tells pandas to look at the entire index
    groups = df1.groupby(level=df1.index.names)  
    groups.last() # or .first()
    #           colA      colB
    #iA iB                    
    #a  a  -1.297535  0.691787
    #   b  -1.688411  0.404430
    #   c   0.275806 -0.078871
    #   d  -0.509815 -0.220326
    #   e  -0.066680  0.607233
    
    0 讨论(0)
  • 2020-11-22 15:16

    If anyone like me likes chainable data manipulation using the pandas dot notation (like piping), then the following may be useful:

    df3 = df3.query('~index.duplicated()')
    

    This enables chaining statements like this:

    df3.assign(C=2).query('~index.duplicated()').mean()
    
    0 讨论(0)
  • 2020-11-22 15:19

    Remove duplicates (Keeping First)

    idx = np.unique( df.index.values, return_index = True )[1]
    df = df.iloc[idx]
    

    Remove duplicates (Keeping Last)

    df = df[::-1]
    df = df.iloc[ np.unique( df.index.values, return_index = True )[1] ]
    

    Tests: 10k loops using OP's data

    numpy method - 3.03 seconds
    df.loc[~df.index.duplicated(keep='first')] - 4.43 seconds
    df.groupby(df.index).first() - 21 seconds
    reset_index() method - 29 seconds
    
    0 讨论(0)
  • 2020-11-22 15:28

    I would suggest using the duplicated method on the Pandas Index itself:

    df3 = df3[~df3.index.duplicated(keep='first')]
    

    While all the other methods work, the currently accepted answer is by far the least performant for the provided example. Furthermore, while the groupby method is only slightly less performant, I find the duplicated method to be more readable.

    Using the sample data provided:

    >>> %timeit df3.reset_index().drop_duplicates(subset='index', keep='first').set_index('index')
    1000 loops, best of 3: 1.54 ms per loop
    
    >>> %timeit df3.groupby(df3.index).first()
    1000 loops, best of 3: 580 µs per loop
    
    >>> %timeit df3[~df3.index.duplicated(keep='first')]
    1000 loops, best of 3: 307 µs per loop
    

    Note that you can keep the last element by changing the keep argument to 'last'.

    It should also be noted that this method works with MultiIndex as well (using df1 as specified in Paul's example):

    >>> %timeit df1.groupby(level=df1.index.names).last()
    1000 loops, best of 3: 771 µs per loop
    
    >>> %timeit df1[~df1.index.duplicated(keep='last')]
    1000 loops, best of 3: 365 µs per loop
    
    0 讨论(0)
  • 2020-11-22 15:31

    Unfortunately, I don't think Pandas allows one to drop dups off the indices. I would suggest the following:

    df3 = df3.reset_index() # makes date column part of your data
    df3.columns = ['timestamp','A','B','rownum'] # set names
    df3 = df3.drop_duplicates('timestamp',take_last=True).set_index('timestamp') #done!
    
    0 讨论(0)
提交回复
热议问题