Python Pandas Error tokenizing data

后端未结

关注

 30  2324

不知归路

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field

相关标签:

30条回答

無奈伤痛

2020-11-22 05:25

I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:

1115794 4218    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180    "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444  2328    "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""



import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')

pandas.io.common.CParserError: Error tokenizing data. C error: out of memory

This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything

counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')

Segmentation fault (core dumped)

Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:

1115794 4218    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180    "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444  2328    "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""


_csv.Error: '   ' expected after '"'

And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.

To avoid creating a new file with replacements I did this, as my tables are small:

from io import StringIO
with open(path_counts) as f:
    input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
    counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')

tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.

0 讨论(0)

后悔当初

2020-11-22 05:26
Sometimes the problem is not how to use python, but with the raw data.
I got this error message
```
Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.
```
It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.
0 讨论(0)
发布评论:

提交评论
- 加载中...
渐次进展

2020-11-22 05:26

use pandas.read_csv('CSVFILENAME',header=None,sep=', ')

when trying to read csv data from the link

http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

I copied the data from the site into my csvfile. It had extra spaces so used sep =', ' and it worked :)

0 讨论(0)
发布评论:

提交评论
- 加载中...
爱一瞬间的悲伤

2020-11-22 05:28
You can try;
```
data = pd.read_csv('file1.csv', sep='\t')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
南笙

2020-11-22 05:29
It might be an issue with
- the delimiters in your data
- the first row, as @TomAugspurger noted
To solve it, try specifying the sep and/or header arguments when calling read_csv. For instance,
```
df = pandas.read_csv(fileName, sep='delimiter', header=None)
```
In the code above, sep defines your delimiter and header=None tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.

According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
0 讨论(0)
发布评论:

提交评论
- 加载中...
南方客

2020-11-22 05:34

Your CSV file might have variable number of columns and read_csv inferred the number of columns from the first few rows. Two ways to solve it in this case:

1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0])

2) Or use names = list(range(0,N)) where N is the max number of columns.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 5 下一页