read_csv with missing/incomplete header or irregular number of columns

前端 未结 4 1863
小鲜肉
小鲜肉 2021-01-18 08:12

I have a file.csv with ~15k rows that looks like this

SAMPLE_TIME,          POS,        OFF,  HISTOGRAM
2015-07-15 16:41:56,  0-0-0-0-3,   1,           


        
相关标签:
4条回答
  • 2021-01-18 08:31

    Assuming your data is in a file called foo.csv, you could do the following. This was tested against Pandas 0.17

    df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)
    
    0 讨论(0)
  • 2021-01-18 08:35

    You can create columns based on the length of the first actual row:

    from tempfile import TemporaryFile
    with open("out.txt") as f, TemporaryFile("w+") as t:
        h, ln = next(f), len(next(f).split(","))
        header = h.strip().split(",")
        f.seek(0), next(f)
        header += range(ln)
        print(pd.read_csv(f, names=header))
    

    Which will give you:

              SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
    0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
    1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
    2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
    3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   
    
       4  5 ...  13  14  15  16  17  18  19  20  21  22  
    0  0  0 ...   0   0   0   0   0 NaN NaN NaN NaN NaN  
    1  0  0 ...   0 NaN NaN NaN NaN NaN NaN NaN NaN NaN  
    2  0  0 ...   4   0   0   0 NaN NaN NaN NaN NaN NaN  
    3  0  0 ...   0   0   0   0 NaN NaN NaN NaN NaN NaN  
    
    [4 rows x 27 columns]
    

    Or you could clean the file before passing to pandas:

    import pandas as pd
    
    from tempfile import TemporaryFile
    with open("in.csv") as f, TemporaryFile("w+") as t:
        for line in f:
            t.write(line.replace(" ", ""))
        t.seek(0)
        ln = len(line.strip().split(","))
        header = t.readline().strip().split(",")
        header += range(ln)
        print(pd.read_csv(t,names=header))
    

    Which gives you:

              SAMPLE_TIME        POS  OFF  HISTOGRAM  0  1   2  3  4  5 ...  11  \
    0  2015-07-1516:41:56  0-0-0-0-3    1          2  0  5  59  0  0  0 ...   0   
    1  2015-07-1516:42:55  0-0-0-0-3    1          0  0  5   9  0  0  0 ...   0   
    2  2015-07-1516:43:55  0-0-0-0-3    1          0  0  5   5  0  0  0 ...   0   
    3  2015-07-1516:44:56  0-0-0-0-3    1          2  0  5   0  0  0  0 ...   0   
    
       12  13  14  15  16  17  18  19  20  
    0   0   0   0   0   0   0 NaN NaN NaN  
    1  50   0 NaN NaN NaN NaN NaN NaN NaN  
    2   0   4   0   0   0 NaN NaN NaN NaN  
    3   6   0   0   0   0 NaN NaN NaN NaN  
    
    [4 rows x 25 columns]
    

    or to drop the columns will all nana:

    print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))
    

    Gives you:

               SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
    0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
    1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
    2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
    3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   
    
       4  5 ...  8  9  10  11  12  13  14  15  16  17  
    0  0  0 ...  2  0   0   0   0   0   0   0   0   0  
    1  0  0 ...  2  0   0   0  50   0 NaN NaN NaN NaN  
    2  0  0 ...  2  0   0   0   0   4   0   0   0 NaN  
    3  0  0 ...  2  0   0   0   6   0   0   0   0 NaN  
    
    [4 rows x 22 columns]
    
    0 讨论(0)
  • 2021-01-18 08:47

    You can split column HISTOGRAM to new DataFrame and concat it to original.

    print df
             SAMPLE_TIME,        POS, OFF,  \
    0 2015-07-15 16:41:56  0-0-0-0-3,   1,   
    1 2015-07-15 16:42:55  0-0-0-0-3,   1,   
    2 2015-07-15 16:43:55  0-0-0-0-3,   1,   
    3 2015-07-15 16:44:56  0-0-0-0-3,   1,   
    
                                     HISTOGRAM  
    0  2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,  
    1          0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,  
    2     0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,  
    3      2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0  
    
    #create new dataframe from column HISTOGRAM
    h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
    print h
      0  1  2   3  4  5  6  7  8  9  10 11 12  13 14 15    16    17    18    19
    0  2  0  5  59  0  0  0  0  0  2  0  0  0   0  0  0     0     0     0      
    1  0  0  5   9  0  0  0  0  0  2  0  0  0  50  0     None  None  None  None
    2  0  0  5   5  0  0  0  0  0  2  0  0  0   0  4  0     0     0        None
    3  2  0  5   0  0  0  0  0  0  2  0  0  0   6  0  0     0     0  None  None
    
    #append to original, rename 0 column
    df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'})
    print df
                                     HISTOGRAM HISTOGRAM  1  2   3  4  5  ...  10  \
    0  2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,         2  0  5  59  0  0  ...   0   
    1          0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,         0  0  5   9  0  0  ...   0   
    2     0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,         0  0  5   5  0  0  ...   0   
    3      2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0         2  0  5   0  0  0  ...   0   
    
      11 12  13 14 15    16    17    18    19  
    0  0  0   0  0  0     0     0     0        
    1  0  0  50  0     None  None  None  None  
    2  0  0   0  4  0     0     0        None  
    3  0  0   6  0  0     0     0  None  None  
    
    [4 rows x 24 columns]
    
    0 讨论(0)
  • 2021-01-18 08:47

    So how about this. I made a csv from your sample data.

    When I import lines:

    with open('test.csv','rb') as f:
        lines = list(csv.reader(f))
    headers, values =lines[0],lines[1:]
    

    to generate nice header names, use this line:

    headers = [i or ind for ind, i in enumerate(headers)]
    

    so because of how (I assume) csv works, headers should have a bunch of empty string values. empty strings evaluate to False, so this comprehension returns numbered columns for each column without a header.

    Then just make a df:

    df = pd.DataFrame(values,columns=headers)
    

    which looks like:

    11:         SAMPLE_TIME           POS         OFF   HISTOGRAM  4  5   6  7  8  9  \
    0  15/07/2015 16:41     0-0-0-0-3           1           2  0  5  59  0  0  0   
    1  15/07/2015 16:42     0-0-0-0-3           1           0  0  5   9  0  0  0   
    2  15/07/2015 16:43     0-0-0-0-3           1           0  0  5   5  0  0  0   
    3  15/07/2015 16:44     0-0-0-0-3           1           2  0  5   0  0  0  0   
    
      ... 12 13 14 15  16 17 18 19 20 21  
    0 ...  2  0  0  0   0  0  0  0  0  0  
    1 ...  2  0  0  0  50  0              
    2 ...  2  0  0  0   0  4  0  0  0     
    3 ...  2  0  0  0   6  0  0  0  0     
    
    [4 rows x 22 columns]
    
    0 讨论(0)
提交回复
热议问题