Pandas: How to workaround “error tokenizing data”?

前端 未结 4 1751
盖世英雄少女心
盖世英雄少女心 2021-02-03 10:51

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed

相关标签:
4条回答
  • 2021-02-03 11:04

    Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses.

    import pandas as pd
    import csv
    
    not_csv = """1,2,3,4,5
    1,2,3,4,5,6
    ,,3,4,5
    1,2,3,4,5,6,7
    ,2,,4
    """
    
    with open('not_a.csv', 'w') as csvfile:
        csvfile.write(not_csv)
    
    d = []
    with open('not_a.csv') as csvfile:
        areader = csv.reader(csvfile)
        max_elems = 0
        for row in areader:
            if max_elems < len(row): max_elems = len(row)
        csvfile.seek(0)
        for i, row in enumerate(areader):
            # fix my csv by padding the rows
            d.append(row + ["" for x in range(max_elems-len(row))])
    
    df = pd.DataFrame(d)
    print df
    
    # the default engine
    # provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
    #df = pd.read_csv('Test.csv',header=None, engine='c')
    
    # the python csv engine
    # provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
    #df = pd.read_csv('Test.csv',header=None, engine='python')
    
    

    Preprocess file outside of python if concerned about extra code inside python creating too much python code.

    Richs-MBP:tmp randrews$ cat test.csv
    1,2,3
    1,
    2
    1,2,
    ,,,
    Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print $1","$2","$3","$4","$5}' < test.csv
    1,2,3,,
    1,,,,
    2,,,,
    1,2,,,
    ,,,,
    
    0 讨论(0)
  • 2021-02-03 11:18

    In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:

    import pandas as pd
    df = pd.read_csv('yourcsvfile.csv', sep=',')
    

    Hope it helps!

    0 讨论(0)
  • 2021-02-03 11:21

    Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data.

    So, the way it works for me for the time being is adapted from @ALollz' compact solution (https://stackoverflow.com/a/55129746/7295599)

    ### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
    import pandas as pd
    
    df = pd.read_csv('Test.csv', header=None, sep='\n')
    df = df[0].str.split(',', expand=True)
    # ... do some modifications with df
    ### end of code
    

    df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end.

       0  1  2  3     4     5     6
    0  1  2  3  4     5  None  None
    1  1  2  3  4     5     6  None
    2        3  4     5  None  None
    3  1  2  3  4     5     6     7
    4     2     4  None  None  None
    

    If you write this again to a file via:

    df.to_csv("Test.tab",sep="\t",header=False,index=False)

    1   2   3   4   5       
    1   2   3   4   5   6   
            3   4   5       
    1   2   3   4   5   6   7
        2       4           
    

    None will be converted to empty string '' and everything is fine.

    The next level would be to account for data strings in quotes which contain the separator, but that's another topic.

    1,2,3,4,5
    ,,3,"Hello, World!",5,6
    1,2,3,4,5,6,7
    
    0 讨论(0)
  • 2021-02-03 11:22

    I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.

    Tokenizer

    def tokenize(str):
        idx = [x for x, v in enumerate(str) if v == '\"']
        if len(idx) % 2 != 0:
            idx = idx[:-1]
        memory = {}
        for i in range(0, len(idx), 2):
            val = str[idx[i]:idx[i+1]+1]
            key = "_"*(len(val)-1)+"{0}".format(i)
            memory[key] = val
            str = str.replace(memory[key], key, 1)        
        return [memory.get(token, token) for token in str.split(",")]  
    

    Test cases for Tokenizer

    print (tokenize("1,2,3,4,5"))
    print (tokenize(",,3,\"Hello, World!\",5,6"))
    print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
    print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
    print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))
    

    Output

    ['1', '2', '3', '4', '5'] ['', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello,,,, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']

    Putting the tokenizer into action

    with open("test1.csv", "r") as fp:
        lines = fp.readlines()
    
    lines = list(map(lambda x: tokenize(x.strip()), lines))
    df = pd.DataFrame(lines).replace(np.nan, '')
    

    Advantage:

    Now we can teak the tokenizer function as per our needs

    0 讨论(0)
提交回复
热议问题