I\'m trying to use pandas to manipulate a .csv file but I get this error:
pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field
The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False
.
The solution in this case was to use the usecols parameter in pd.read_csv()
. This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).
usecols : list-like or callable, optional Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
my_columns = ['foo', 'bar', 'bob']
df = pd.read_csv(file_path, usecols=my_columns)
Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.