Pandas: How to workaround “error tokenizing data”?

前端 未结 4 1752
盖世英雄少女心
盖世英雄少女心 2021-02-03 10:51

A lot of questions have been already asked about this topic on SO. (and many others). Among the numerous answers, none of them was really helpful to me so far. If I missed

4条回答
  •  情深已故
    2021-02-03 11:22

    I have a different take on the solution. Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer.

    Tokenizer

    def tokenize(str):
        idx = [x for x, v in enumerate(str) if v == '\"']
        if len(idx) % 2 != 0:
            idx = idx[:-1]
        memory = {}
        for i in range(0, len(idx), 2):
            val = str[idx[i]:idx[i+1]+1]
            key = "_"*(len(val)-1)+"{0}".format(i)
            memory[key] = val
            str = str.replace(memory[key], key, 1)        
        return [memory.get(token, token) for token in str.split(",")]  
    

    Test cases for Tokenizer

    print (tokenize("1,2,3,4,5"))
    print (tokenize(",,3,\"Hello, World!\",5,6"))
    print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
    print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
    print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))
    

    Output

    ['1', '2', '3', '4', '5'] ['', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello,,,, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']

    Putting the tokenizer into action

    with open("test1.csv", "r") as fp:
        lines = fp.readlines()
    
    lines = list(map(lambda x: tokenize(x.strip()), lines))
    df = pd.DataFrame(lines).replace(np.nan, '')
    

    Advantage:

    Now we can teak the tokenizer function as per our needs

提交回复
热议问题