Python Pandas Error tokenizing data

后端 未结 30 2080
不知归路
不知归路 2020-11-22 04:49

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field

相关标签:
30条回答
  • 2020-11-22 05:34

    As far as I can tell, and after taking a look at your file, the problem is that the csv file you're trying to load has multiple tables. There are empty lines, or lines that contain table titles. Try to have a look at this Stackoverflow answer. It shows how to achieve that programmatically.

    Another dynamic approach to do that would be to use the csv module, read every single row at a time and make sanity checks/regular expressions, to infer if the row is (title/header/values/blank). You have one more advantage with this approach, that you can split/append/collect your data in python objects as desired.

    The easiest of all would be to use pandas function pd.read_clipboard() after manually selecting and copying the table to the clipboard, in case you can open the csv in excel or something.

    Irrelevant:

    Additionally, irrelevant to your problem, but because no one made mention of this: I had this same issue when loading some datasets such as seeds_dataset.txt from UCI. In my case, the error was occurring because some separators had more whitespaces than a true tab \t. See line 3 in the following for instance

    14.38   14.21   0.8951  5.386   3.312   2.462   4.956   1
    14.69   14.49   0.8799  5.563   3.259   3.586   5.219   1
    14.11   14.1    0.8911  5.42    3.302   2.7     5       1
    

    Therefore, use \t+ in the separator pattern instead of \t.

    data = pd.read_csv(path, sep='\t+`, header=None)
    
    0 讨论(0)
  • 2020-11-22 05:36

    I had this problem, where I was trying to read in a CSV without passing in column names.

    df = pd.read_csv(filename, header=None)
    

    I specified the column names in a list beforehand and then pass them into names, and it solved it immediately. If you don't have set column names, you could just create as many placeholder names as the maximum number of columns that might be in your data.

    col_names = ["col1", "col2", "col3", ...]
    df = pd.read_csv(filename, names=col_names)
    
    0 讨论(0)
  • 2020-11-22 05:36

    The following worked for me (I posted this answer, because I specifically had this problem in a Google Colaboratory Notebook):

    df = pd.read_csv("/path/foo.csv", delimiter=';', skiprows=0, low_memory=False)
    
    0 讨论(0)
  • 2020-11-22 05:36

    I have encountered this error with a stray quotation mark. I use mapping software which will put quotation marks around text items when exporting comma-delimited files. Text which uses quote marks (e.g. ' = feet and " = inches) can be problematic when then induce delimiter collisions. Consider this example which notes that a 5-inch well log print is poor:

    UWI_key,Latitude,Longitude,Remark US42051316890000,30.4386484,-96.4330734,"poor 5""

    Using 5" as shorthand for 5 inch ends up throwing a wrench in the works. Excel will simply strip off the extra quote mark, but Pandas breaks down without the error_bad_lines=False argument mentioned above.

    0 讨论(0)
  • 2020-11-22 05:38

    I have the same problem when read_csv: ParserError: Error tokenizing data. I just saved the old csv file to a new csv file. The problem is solved!

    0 讨论(0)
  • The parser is getting confused by the header of the file. It reads the first row and infers the number of columns from that row. But the first two rows aren't representative of the actual data in the file.

    Try it with data = pd.read_csv(path, skiprows=2)

    0 讨论(0)
提交回复
热议问题