Pandas: How to workaround “error tokenizing data”?

前端 未结 4 1750
盖世英雄少女心
盖世英雄少女心 2021-02-03 10:51

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed

4条回答
  •  深忆病人
    2021-02-03 11:04

    Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.

    import pandas as pd
    import csv
    
    not_csv = """1,2,3,4,5
    1,2,3,4,5,6
    ,,3,4,5
    1,2,3,4,5,6,7
    ,2,,4
    """
    
    with open('not_a.csv', 'w') as csvfile:
        csvfile.write(not_csv)
    
    d = []
    with open('not_a.csv') as csvfile:
        areader = csv.reader(csvfile)
        max_elems = 0
        for row in areader:
            if max_elems < len(row): max_elems = len(row)
        csvfile.seek(0)
        for i, row in enumerate(areader):
            # fix my csv by padding the rows
            d.append(row + ["" for x in range(max_elems-len(row))])
    
    df = pd.DataFrame(d)
    print df
    
    # the default engine
    # provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
    #df = pd.read_csv('Test.csv',header=None, engine='c')
    
    # the python csv engine
    # provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
    #df = pd.read_csv('Test.csv',header=None, engine='python')
    
    

    Preprocess file outside of python if concerned about extra code inside python creating too much python code.

    Richs-MBP:tmp randrews$ cat test.csv
    1,2,3
    1,
    2
    1,2,
    ,,,
    Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print $1","$2","$3","$4","$5}' < test.csv
    1,2,3,,
    1,,,,
    2,,,,
    1,2,,,
    ,,,,
    

提交回复
热议问题