Python Pandas Error tokenizing data

后端 未结 30 2229
不知归路
不知归路 2020-11-22 04:49

I\'m trying to use pandas to manipulate a .csv file but I get this error:

pandas.parser.CParserError: Error tokenizing data. C error: Expected 2 field

相关标签:
30条回答
  • 2020-11-22 05:41

    The issue for me was that a new column was appended to my CSV intraday. The accepted answer solution would not work as every future row would be discarded if I used error_bad_lines=False.

    The solution in this case was to use the usecols parameter in pd.read_csv(). This way I can specify only the columns that I need to read into the CSV and my Python code will remain resilient to future CSV changes so long as a header column exists (and the column names do not change).

    usecols : list-like or callable, optional 
    
    Return a subset of the columns. If list-like, all elements must either
    be positional (i.e. integer indices into the document columns) or
    strings that correspond to column names provided either by the user in
    names or inferred from the document header row(s). For example, a
    valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar',
    'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1,
    0]. To instantiate a DataFrame from data with element order preserved
    use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for
    columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo',
    'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
    

    Example

    my_columns = ['foo', 'bar', 'bob']
    df = pd.read_csv(file_path, usecols=my_columns)
    

    Another benefit of this is that I can load way less data into memory if I am only using 3-4 columns of a CSV that has 18-20 columns.

    0 讨论(0)
  • 2020-11-22 05:42

    you could also try;

    data = pd.read_csv('file1.csv', error_bad_lines=False)
    

    Do note that this will cause the offending lines to be skipped.

    0 讨论(0)
  • 2020-11-22 05:42

    This is definitely an issue of delimiter, as most of the csv CSV are got create using sep='/t' so try to read_csv using the tab character (\t) using separator /t. so, try to open using following code line.

    data=pd.read_csv("File_path", sep='\t')
    
    0 讨论(0)
  • 2020-11-22 05:42

    In my case, it is because the format of the first and last two lines of the csv file is different from the middle content of the file.

    So what I do is open the csv file as a string, parse the content of the string, then use read_csv to get a dataframe.

    import io
    import pandas as pd
    
    file = open(f'{file_path}/{file_name}', 'r')
    content = file.read()
    
    # change new line character from '\r\n' to '\n'
    lines = content.replace('\r', '').split('\n')
    
    # Remove the first and last 2 lines of the file
    # StringIO can be considered as a file stored in memory
    df = pd.read_csv(StringIO("\n".join(lines[2:-2])), header=None)
    
    0 讨论(0)
  • 2020-11-22 05:43

    Use delimiter in parameter

    pd.read_csv(filename, delimiter=",", encoding='utf-8')
    

    It will read.

    0 讨论(0)
  • 2020-11-22 05:45

    I had this problem as well but perhaps for a different reason. I had some trailing commas in my CSV that were adding an additional column that pandas was attempting to read. Using the following works but it simply ignores the bad lines:

    data = pd.read_csv('file1.csv', error_bad_lines=False)
    

    If you want to keep the lines an ugly kind of hack for handling the errors is to do something like the following:

    line     = []
    expected = []
    saw      = []     
    cont     = True 
    
    while cont == True:     
        try:
            data = pd.read_csv('file1.csv',skiprows=line)
            cont = False
        except Exception as e:    
            errortype = e.message.split('.')[0].strip()                                
            if errortype == 'Error tokenizing data':                        
               cerror      = e.message.split(':')[1].strip().replace(',','')
               nums        = [n for n in cerror.split(' ') if str.isdigit(n)]
               expected.append(int(nums[0]))
               saw.append(int(nums[2]))
               line.append(int(nums[1])-1)
             else:
               cerror      = 'Unknown'
               print 'Unknown Error - 222'
    
    if line != []:
        # Handle the errors however you want
    

    I proceeded to write a script to reinsert the lines into the DataFrame since the bad lines will be given by the variable 'line' in the above code. This can all be avoided by simply using the csv reader. Hopefully the pandas developers can make it easier to deal with this situation in the future.

    0 讨论(0)
提交回复
热议问题