Python Pandas Error tokenizing data

后端未结

关注

 30  2326

不知归路

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field

相关标签:

30条回答

遇见更好的自我

2020-11-22 05:34
As far as I can tell, and after taking a look at your file, the problem is that the csv file you're trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

Irrelevant:

Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance
```
14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
14.11   14.1    0.8911  5.42    3.302   2.7     5       1
```
Therefore, use \t+ in the separator pattern instead of \t.
```
data = pd.read_csv(path, sep='\t+`, header=None)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-11-22 05:36
I had this problem, where I was trying to read in a CSV without passing in column names.
```
df = pd.read_csv(filename, header=None)
```
I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don't have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.
```
col_names = ["col1", "col2", "col3", ...]
df = pd.read_csv(filename, names=col_names)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
一生所求

2020-11-22 05:36
The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):
```
df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
春和景丽

2020-11-22 05:36

I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ' = feet and " = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.

0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-11-22 05:38

I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!

0 讨论(0)
发布评论:

提交评论
- 加载中...
不要未来只要你来

2020-11-22 05:39

The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren't representative of the actual data in the file.

Try it with data = pd.read_csv(path, skiprows=2)

0 讨论(0)
发布评论:

提交评论
- 加载中...