Python Pandas Error tokenizing data

后端 未结 30 2225
不知归路
不知归路 2020-11-22 04:49

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field

相关标签:
30条回答
  • 2020-11-22 05:25

    I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

    1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
    1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
    368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""
    
    
    
    import pandas as pd
    # Same error for read_table
    counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')
    
    pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
    

    This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

    counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')
    
    Segmentation fault (core dumped)
    

    Now that is a different error.
    If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

    1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
    1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
    368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""
    
    
    _csv.Error: '   ' expected after '"'
    

    And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

    To avoid creating a new file with replacements I did this, as my tables are small:

    from io import StringIO
    with open(path_counts) as f:
        input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
        counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')
    

    tl;dr
    Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.

    0 讨论(0)
  • 2020-11-22 05:26

    Sometimes the problem is not how to use python, but with the raw data.
    I got this error message

    Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.
    

    It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.

    0 讨论(0)
  • 2020-11-22 05:26

    use pandas.read_csv('CSVFILENAME',header=None,sep=', ')

    when trying to read csv data from the link

    http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

    I copied the data from the site into my csvfile. It had extra spaces so used sep =', ' and it worked :)

    0 讨论(0)
  • 2020-11-22 05:28

    You can try;

    data = pd.read_csv('file1.csv', sep='\t')
    
    0 讨论(0)
  • 2020-11-22 05:29

    It might be an issue with

    • the delimiters in your data
    • the first row, as @TomAugspurger noted

    To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,

    df = pandas.read_csv(fileName, sep='delimiter', header=None)
    

    In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

    According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.

    0 讨论(0)
  • 2020-11-22 05:34

    Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

    1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

    2) Or use names = list(range(0,N)) where N is the max number of columns.

    0 讨论(0)
提交回复
热议问题