Pandas every nth row

前端 未结 5 1198
别那么骄傲
别那么骄傲 2020-11-30 19:40

Dataframe.resample() works only with timeseries data. I cannot find a way of getting every nth row from non-timeseries data. What is the best method?

相关标签:
5条回答
  • 2020-11-30 20:02

    Though @chrisb's accepted answer does answer the question, I would like to add to it the following.

    A simple method I use to get the nth data or drop the nth row is the following:

    df1 = df[df.index % 3 != 0]  # Excludes every 3rd row starting from 0
    df2 = df[df.index % 3 == 0]  # Selects every 3rd raw starting from 0
    

    This arithmetic based sampling has the ability to enable even more complex row-selections.

    This assumes, of course, that you have an index column of ordered, consecutive, integers starting at 0.

    0 讨论(0)
  • 2020-11-30 20:02

    I had a similar requirement, but I wanted the n'th item in a particular group. This is how I solved it.

    groups = data.groupby(['group_key'])
    selection = groups['index_col'].apply(lambda x: x % 3 == 0)
    subset = data[selection]
    
    0 讨论(0)
  • 2020-11-30 20:19

    There is an even simpler solution to the accepted answer that involves directly invoking df.__getitem__.

    df = pd.DataFrame('x', index=range(5), columns=list('abc'))
    df
    
       a  b  c
    0  x  x  x
    1  x  x  x
    2  x  x  x
    3  x  x  x
    4  x  x  x
    

    For example, to get every 2 rows, you can do

    df[::2]
    
       a  b  c
    0  x  x  x
    2  x  x  x
    4  x  x  x
    

    There's also GroupBy.first/GroupBy.head, you group on the index:

    df.index // 2
    # Int64Index([0, 0, 1, 1, 2], dtype='int64')
    
    df.groupby(df.index // 2).first()
    # Alternatively,
    # df.groupby(df.index // 2).head(1)
    
       a  b  c
    0  x  x  x
    1  x  x  x
    2  x  x  x
    

    The index is floor-divved by the stride (2, in this case). If the index is non-numeric, instead do

    # df.groupby(np.arange(len(df)) // 2).first()
    df.groupby(pd.RangeIndex(len(df)) // 2).first()
    
       a  b  c
    0  x  x  x
    1  x  x  x
    2  x  x  x
    
    0 讨论(0)
  • 2020-11-30 20:23

    I'd use iloc, which takes a row/column slice, both based on integer position and following normal python syntax. If you want every 5th row:

    df.iloc[::5, :]
    
    0 讨论(0)
  • 2020-11-30 20:23

    A solution I came up with when using the index was not viable ( possibly the multi-Gig .csv was too large, or I missed some technique that would allow me to reindex without crashing ).
    Walk through one row at a time and add the nth row to a new dataframe.

    import pandas as pd
    from csv import DictReader
    
    def make_downsampled_df(filename, interval):    
        with open(filename, 'r') as read_obj:
            csv_dict_reader = DictReader(read_obj)
            column_names = csv_dict_reader.fieldnames
            df = pd.DataFrame(columns=column_names)
        
            for index, row in enumerate(csv_dict_reader):
                if index % interval == 0:
                   print(str(row))
                   df = df.append(row, ignore_index=True)
    
        return df
    
    0 讨论(0)
提交回复
热议问题