pandas.read_csv: how to skip comment lines

后端 未结 3 1052
礼貌的吻别
礼貌的吻别 2020-12-05 07:00

I think I misunderstand the intention of read_csv. If I have a file \'j\' like

# notes
a,b,c
# more notes
1,2,3

How can I pandas.read_csv t

相关标签:
3条回答
  • 2020-12-05 07:19

    So I believe in the latest releases of pandas (version 0.16.0), you could throw in the comment='#' parameter into pd.read_csv and this should skip commented out lines.

    These github issues shows that you can do this:

    • https://github.com/pydata/pandas/issues/10548
    • https://github.com/pydata/pandas/issues/4623

    See the documentation on read_csv: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

    0 讨论(0)
  • 2020-12-05 07:25

    I am on Pandas version 0.13.1 and this comments-in-csv problem still bothers me.

    Here is my present workaround:

    def read_csv(filename, comment='#', sep=','):
        lines = "".join([line for line in open(filename) 
                         if not line.startswith(comment)])
        return pd.read_csv(StringIO(lines), sep=sep)
    

    Otherwise with pd.read_csv(filename, comment='#') I get

    pandas.parser.CParserError: Error tokenizing data. C error: Expected 1 fields in line 16, saw 3.

    0 讨论(0)
  • 2020-12-05 07:29

    One workaround is to specify skiprows to ignore the first few entries:

    In [11]: s = '# notes\na,b,c\n# more notes\n1,2,3'
    
    In [12]: pd.read_csv(StringIO(s), sep=',', comment='#', skiprows=1)
    Out[12]: 
        a   b   c
    0 NaN NaN NaN
    1   1   2   3
    

    Otherwise read_csv gets a little confused:

    In [13]: pd.read_csv(StringIO(s), sep=',', comment='#')
    Out[13]: 
            Unnamed: 0
    a   b            c
    NaN NaN        NaN
    1   2            3
    

    This seems to be the case in 0.12.0, I've filed a bug report.

    As Viktor points out you can use dropna to remove the NaN after the fact... (there is a recent open issue to have commented lines be ignored completely):

    In [14]: pd.read_csv(StringIO(s2), comment='#', sep=',').dropna(how='all')
    Out[14]: 
       a  b  c
    1  1  2  3
    

    Note: the default index will "give away" the fact there was missing data.

    0 讨论(0)
提交回复
热议问题