read_csv with missing/incomplete header or irregular number of columns

前端 未结 4 1860
小鲜肉
小鲜肉 2021-01-18 08:12

I have a file.csv with ~15k rows that looks like this

SAMPLE_TIME,          POS,        OFF,  HISTOGRAM
2015-07-15 16:41:56,  0-0-0-0-3,   1,           


        
4条回答
  •  别那么骄傲
    2021-01-18 08:35

    You can create columns based on the length of the first actual row:

    from tempfile import TemporaryFile
    with open("out.txt") as f, TemporaryFile("w+") as t:
        h, ln = next(f), len(next(f).split(","))
        header = h.strip().split(",")
        f.seek(0), next(f)
        header += range(ln)
        print(pd.read_csv(f, names=header))
    

    Which will give you:

              SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
    0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
    1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
    2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
    3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   
    
       4  5 ...  13  14  15  16  17  18  19  20  21  22  
    0  0  0 ...   0   0   0   0   0 NaN NaN NaN NaN NaN  
    1  0  0 ...   0 NaN NaN NaN NaN NaN NaN NaN NaN NaN  
    2  0  0 ...   4   0   0   0 NaN NaN NaN NaN NaN NaN  
    3  0  0 ...   0   0   0   0 NaN NaN NaN NaN NaN NaN  
    
    [4 rows x 27 columns]
    

    Or you could clean the file before passing to pandas:

    import pandas as pd
    
    from tempfile import TemporaryFile
    with open("in.csv") as f, TemporaryFile("w+") as t:
        for line in f:
            t.write(line.replace(" ", ""))
        t.seek(0)
        ln = len(line.strip().split(","))
        header = t.readline().strip().split(",")
        header += range(ln)
        print(pd.read_csv(t,names=header))
    

    Which gives you:

              SAMPLE_TIME        POS  OFF  HISTOGRAM  0  1   2  3  4  5 ...  11  \
    0  2015-07-1516:41:56  0-0-0-0-3    1          2  0  5  59  0  0  0 ...   0   
    1  2015-07-1516:42:55  0-0-0-0-3    1          0  0  5   9  0  0  0 ...   0   
    2  2015-07-1516:43:55  0-0-0-0-3    1          0  0  5   5  0  0  0 ...   0   
    3  2015-07-1516:44:56  0-0-0-0-3    1          2  0  5   0  0  0  0 ...   0   
    
       12  13  14  15  16  17  18  19  20  
    0   0   0   0   0   0   0 NaN NaN NaN  
    1  50   0 NaN NaN NaN NaN NaN NaN NaN  
    2   0   4   0   0   0 NaN NaN NaN NaN  
    3   6   0   0   0   0 NaN NaN NaN NaN  
    
    [4 rows x 25 columns]
    

    or to drop the columns will all nana:

    print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
    

    Gives you:

               SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
    0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
    1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
    2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
    3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   
    
       4  5 ...  8  9  10  11  12  13  14  15  16  17  
    0  0  0 ...  2  0   0   0   0   0   0   0   0   0  
    1  0  0 ...  2  0   0   0  50   0 NaN NaN NaN NaN  
    2  0  0 ...  2  0   0   0   0   4   0   0   0 NaN  
    3  0  0 ...  2  0   0   0   6   0   0   0   0 NaN  
    
    [4 rows x 22 columns]
    

提交回复
热议问题