conditional row read of csv in pandas

前端 未结 4 824
无人及你
无人及你 2020-12-06 12:43

I have large CSVs where I\'m only interested in a subset of the rows. In particular, I\'d like to read in all the rows which occur before a particular condition is met.

相关标签:
4条回答
  • 2020-12-06 12:54

    You can use the built-in csv module to calculate the appropriate row number. Then use pd.read_csv with the nrows argument:

    from io import StringIO
    import pandas as pd
    import csv, copy
    
    mycsv = StringIO(""" A      B     C
    34   3.20   'b'
    24   9.21   'b'
    34   3.32   'c'
    24   24.3   'c'
    35   1.12   'a'""")
    
    mycsv2 = copy.copy(mycsv)  # copying StringIO object [for demonstration purposes]
    
    with mycsv as fin:
        reader = csv.reader(fin, delimiter=' ', skipinitialspace=True)
        header = next(reader)
        counter = next(idx for idx, row in enumerate(reader) if float(row[1]) > 10)
    
    df = pd.read_csv(mycsv2, delim_whitespace=True, nrows=counter+1)
    
    print(df)
    
        A      B    C
    0  34   3.20  'b'
    1  24   9.21  'b'
    2  34   3.32  'c'
    3  24  24.30  'c'
    
    0 讨论(0)
  • 2020-12-06 12:55

    You could read the csv in chunks. Since pd.read_csv will return an iterator when the chunksize parameter is specified, you can use itertools.takewhile to read only as many chunks as you need, without reading the whole file.

    import itertools as IT
    import pandas as pd
    
    chunksize = 10 ** 5
    chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
    chunks = IT.takewhile(lambda chunk: chunk['B'].iloc[-1] < 10, chunks)
    df = pd.concat(chunks)
    mask = df['B'] < 10
    df = df.loc[mask]
    

    Or, to avoid having to use df.loc[mask] to remove unwanted rows from the last chunk, perhaps a cleaner solution would be to define a custom generator:

    import itertools as IT
    import pandas as pd
    
    def valid(chunks):
        for chunk in chunks:
            mask = chunk['B'] < 10
            if mask.all():
                yield chunk
            else:
                yield chunk.loc[mask]
                break
    
    chunksize = 10 ** 5
    chunks = pd.read_csv(filename, chunksize=chunksize, header=None)
    df = pd.concat(valid(chunks))
    
    0 讨论(0)
  • 2020-12-06 13:02

    Building on @joanwa answer:

    df = (pd.read_csv("filename.csv")
          [lambda x: x['B'] > 10])
    

    From Wes McKinney's "Python for Data Analysis" chapter on "Advanced pandas":

    We cannot refer to the result of load_data until it has been assigned to the temporary variable df. To help with this, assign and many other pandas functions accept function-like arguments, also known as callables.

    To show callables in action, consider ...

    df = load_data()
    df2 = df[df['col2'] < 0]
    

    Can be rewritten as:

    df = (load_data()
          [lambda x: x['col2'] < 0])
    
    0 讨论(0)
  • 2020-12-06 13:04

    I would go the easy route described here:

    http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing

    df[df['B'] > 10]
    
    0 讨论(0)
提交回复
热议问题