I\'m trying to use pandas to manipulate a .csv file but I get this error:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field
I've had a similar problem while trying to read a tab-delimited table with spaces, commas and quotes:
1115794 4218 "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", ""
1144102 3180 "k__Bacteria", "p__Firmicutes", "c__Bacilli", "o__Bacillales", "f__Bacillaceae", "g__Bacillus", ""
368444 2328 "k__Bacteria", "p__Bacteroidetes", "c__Bacteroidia", "o__Bacteroidales", "f__Bacteroidaceae", "g__Bacteroides", ""
import pandas as pd
# Same error for read_table
counts = pd.read_csv(path_counts, sep='\t', index_col=2, header=None, engine = 'c')
pandas.io.common.CParserError: Error tokenizing data. C error: out of memory
This says it has something to do with C parsing engine (which is the default one). Maybe changing to a python one will change anything
counts = pd.read_table(path_counts, sep='\t', index_col=2, header=None, engine='python')
Segmentation fault (core dumped)
Now that is a different error.
If we go ahead and try to remove spaces from the table, the error from python-engine changes once again:
1115794 4218 "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae",""
1144102 3180 "k__Bacteria","p__Firmicutes","c__Bacilli","o__Bacillales","f__Bacillaceae","g__Bacillus",""
368444 2328 "k__Bacteria","p__Bacteroidetes","c__Bacteroidia","o__Bacteroidales","f__Bacteroidaceae","g__Bacteroides",""
_csv.Error: ' ' expected after '"'
And it gets clear that pandas was having problems parsing our rows. To parse a table with python engine I needed to remove all spaces and quotes from the table beforehand. Meanwhile C-engine kept crashing even with commas in rows.
To avoid creating a new file with replacements I did this, as my tables are small:
from io import StringIO
with open(path_counts) as f:
input = StringIO(f.read().replace('", ""', '').replace('"', '').replace(', ', ',').replace('\0',''))
counts = pd.read_table(input, sep='\t', index_col=2, header=None, engine='python')
tl;dr
Change parsing engine, try to avoid any non-delimiting quotes/commas/spaces in your data.
Sometimes the problem is not how to use python, but with the raw data.
I got this error message
Error tokenizing data. C error: Expected 18 fields in line 72, saw 19.
It turned out that in the column description there were sometimes commas. This means that the CSV file needs to be cleaned up or another separator used.
use
pandas.read_csv('CSVFILENAME',header=None,sep=', ')
when trying to read csv data from the link
http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
I copied the data from the site into my csvfile. It had extra spaces so used sep =', ' and it worked :)
You can try;
data = pd.read_csv('file1.csv', sep='\t')
It might be an issue with
To solve it, try specifying the sep
and/or header
arguments when calling read_csv
. For instance,
df = pandas.read_csv(fileName, sep='delimiter', header=None)
In the code above, sep
defines your delimiter and header=None
tells pandas that your source data has no row for headers / column titles. Thus saith the docs: "If file contains no header row, then you should explicitly pass header=None". In this instance, pandas automatically creates whole-number indices for each field {0,1,2,...}.
According to the docs, the delimiter thing should not be an issue. The docs say that "if sep is None [not specified], will try to automatically determine this." I however have not had good luck with this, including instances with obvious delimiters.
Your CSV file might have variable number of columns and read_csv
inferred the number of columns from the first few rows. Two ways to solve it in this case:
1) Change the CSV file to have a dummy first line with max number of columns (and specify header=[0]
)
2) Or use names = list(range(0,N))
where N is the max number of columns.