I\'m trying to use pandas to manipulate a .csv file but I get this error:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field
The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False
.
The solution in this case was to use the usecols parameter in pd.read_csv()
. This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).
usecols : list-like or callable, optional Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)
Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.
you could also try;
data = pd.read_csv('file1.csv', error_bad_lines=False)
Do note that this will cause the offending lines to be skipped.
This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t'
so try to read_csv
using the tab character (\t)
using separator /t
. so, try to open using following code line.
data=pd.read_csv("File_path", sep='\t')
In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.
So what I do is open the csv file as a string, parse the content of the string, then use read_csv
to get a dataframe.
import io
import pandas as pd
file = open(f'{file_path}/{file_name}', 'r')
content = file.read()
# change new line character from '\r\n' to '\n'
lines = content.replace('\r', '').split('\n')
# Remove the first and last 2 lines of the file
# StringIO can be considered as a file stored in memory
df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)
Use delimiter in parameter
pd.read_csv(filename, delimiter=",", encoding='utf-8')
It will read.
I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:
data = pd.read_csv('file1.csv', error_bad_lines=False)
If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:
line = []
expected = []
saw = []
cont = True
while cont == True:
try:
data = pd.read_csv('file1.csv',skiprows=line)
cont = False
except Exception as e:
errortype = e.message.split('.')[0].strip()
if errortype == 'Error tokenizing data':
cerror = e.message.split(':')[1].strip().replace(',','')
nums = [n for n in cerror.split(' ') if str.isdigit(n)]
expected.append(int(nums[0]))
saw.append(int(nums[2]))
line.append(int(nums[1])-1)
else:
cerror = 'Unknown'
print 'Unknown Error - 222'
if line != []:
# Handle the errors however you want
I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable 'line' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.