pandas read_csv and filter columns with usecols

后端 未结 5 1333
小鲜肉
小鲜肉 2020-11-28 03:00

I have a csv file which isn\'t coming in correctly with pandas.read_csv when I filter the columns with usecols and use multiple indexes.

相关标签:
5条回答
  • 2020-11-28 03:11

    This code achieves what you want --- also its weird and certainly buggy:

    I observed that it works when:

    a) you specify the index_col rel. to the number of columns you really use -- so its three columns in this example, not four (you drop dummy and start counting from then onwards)

    b) same for parse_dates

    c) not so for usecols ;) for obvious reasons

    d) here I adapted the names to mirror this behaviour

    import pandas as pd
    from StringIO import StringIO
    
    csv = """dummy,date,loc,x
    bar,20090101,a,1
    bar,20090102,a,3
    bar,20090103,a,5
    bar,20090101,b,1
    bar,20090102,b,3
    bar,20090103,b,5
    """
    
    df = pd.read_csv(StringIO(csv),
            index_col=[0,1],
            usecols=[1,2,3], 
            parse_dates=[0],
            header=0,
            names=["date", "loc", "", "x"])
    
    print df
    

    which prints

                    x
    date       loc   
    2009-01-01 a    1
    2009-01-02 a    3
    2009-01-03 a    5
    2009-01-01 b    1
    2009-01-02 b    3
    2009-01-03 b    5
    
    0 讨论(0)
  • 2020-11-28 03:14

    If your csv file contains extra data, columns can be deleted from the DataFrame after import.

    import pandas as pd
    from StringIO import StringIO
    
    csv = r"""dummy,date,loc,x
    bar,20090101,a,1
    bar,20090102,a,3
    bar,20090103,a,5
    bar,20090101,b,1
    bar,20090102,b,3
    bar,20090103,b,5"""
    
    df = pd.read_csv(StringIO(csv),
            index_col=["date", "loc"], 
            usecols=["dummy", "date", "loc", "x"],
            parse_dates=["date"],
            header=0,
            names=["dummy", "date", "loc", "x"])
    del df['dummy']
    

    Which gives us:

                    x
    date       loc
    2009-01-01 a    1
    2009-01-02 a    3
    2009-01-03 a    5
    2009-01-01 b    1
    2009-01-02 b    3
    2009-01-03 b    5
    
    0 讨论(0)
  • 2020-11-28 03:17

    The answer by @chip completely misses the point of two keyword arguments.

    • names is only necessary when there is no header and you want to specify other arguments using column names rather than integer indices.
    • usecols is supposed to provide a filter before reading the whole DataFrame into memory; if used properly, there should never be a need to delete columns after reading.

    This solution corrects those oddities:

    import pandas as pd
    from StringIO import StringIO
    
    csv = r"""dummy,date,loc,x
    bar,20090101,a,1
    bar,20090102,a,3
    bar,20090103,a,5
    bar,20090101,b,1
    bar,20090102,b,3
    bar,20090103,b,5"""
    
    df = pd.read_csv(StringIO(csv),
            header=0,
            index_col=["date", "loc"], 
            usecols=["date", "loc", "x"],
            parse_dates=["date"])
    

    Which gives us:

                    x
    date       loc
    2009-01-01 a    1
    2009-01-02 a    3
    2009-01-03 a    5
    2009-01-01 b    1
    2009-01-02 b    3
    2009-01-03 b    5
    
    0 讨论(0)
  • 2020-11-28 03:22

    import csv first and use csv.DictReader its easy to process...

    0 讨论(0)
  • 2020-11-28 03:23

    You have to just add the index_col=False parameter

    df1 = pd.read_csv('foo.csv',
         header=0,
         index_col=False,
         names=["dummy", "date", "loc", "x"], 
         index_col=["date", "loc"], 
         usecols=["dummy", "date", "loc", "x"],
         parse_dates=["date"])
      print df1
    
    0 讨论(0)
提交回复
热议问题