read_csv with missing/incomplete header or irregular number of columns

前端未结

关注

 4  1863

I have a file.csv with ~15k rows that looks like this

SAMPLE_TIME,          POS,        OFF,  HISTOGRAM
2015-07-15 16:41:56,  0-0-0-0-3,   1,


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  不思量自难忘°        
                
              
                            
                2021-01-18 08:31
              
            
            
                                                                       
Assuming your data is in a file called foo.csv, you could do the following.  This was tested against Pandas 0.17

df = pd.read_csv('foo.csv', names=['sample_time', 'pos', 'off', 'histogram', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17'], skiprows=1)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  别那么骄傲        
                
              
                            
                2021-01-18 08:35
              
            
            
                                                                       
You can create columns based on the length of the first actual row:

from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
    h, ln = next(f), len(next(f).split(","))
    header = h.strip().split(",")
    f.seek(0), next(f)
    header += range(ln)
    print(pd.read_csv(f, names=header))


Which will give you:

          SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   

   4  5 ...  13  14  15  16  17  18  19  20  21  22  
0  0  0 ...   0   0   0   0   0 NaN NaN NaN NaN NaN  
1  0  0 ...   0 NaN NaN NaN NaN NaN NaN NaN NaN NaN  
2  0  0 ...   4   0   0   0 NaN NaN NaN NaN NaN NaN  
3  0  0 ...   0   0   0   0 NaN NaN NaN NaN NaN NaN  

[4 rows x 27 columns]


Or you could clean the file before passing to pandas:

import pandas as pd

from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
    for line in f:
        t.write(line.replace(" ", ""))
    t.seek(0)
    ln = len(line.strip().split(","))
    header = t.readline().strip().split(",")
    header += range(ln)
    print(pd.read_csv(t,names=header))


Which gives you:

          SAMPLE_TIME        POS  OFF  HISTOGRAM  0  1   2  3  4  5 ...  11  \
0  2015-07-1516:41:56  0-0-0-0-3    1          2  0  5  59  0  0  0 ...   0   
1  2015-07-1516:42:55  0-0-0-0-3    1          0  0  5   9  0  0  0 ...   0   
2  2015-07-1516:43:55  0-0-0-0-3    1          0  0  5   5  0  0  0 ...   0   
3  2015-07-1516:44:56  0-0-0-0-3    1          2  0  5   0  0  0  0 ...   0   

   12  13  14  15  16  17  18  19  20  
0   0   0   0   0   0   0 NaN NaN NaN  
1  50   0 NaN NaN NaN NaN NaN NaN NaN  
2   0   4   0   0   0 NaN NaN NaN NaN  
3   6   0   0   0   0 NaN NaN NaN NaN  

[4 rows x 25 columns]


or to drop the columns will all nana:

print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))


Gives you:

           SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   

   4  5 ...  8  9  10  11  12  13  14  15  16  17  
0  0  0 ...  2  0   0   0   0   0   0   0   0   0  
1  0  0 ...  2  0   0   0  50   0 NaN NaN NaN NaN  
2  0  0 ...  2  0   0   0   0   4   0   0   0 NaN  
3  0  0 ...  2  0   0   0   6   0   0   0   0 NaN  

[4 rows x 22 columns]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  天涯浪人        
                
              
                            
                2021-01-18 08:47
              
            
            
                                                                       
You can split column HISTOGRAM to new DataFrame and concat it to original.

print df
         SAMPLE_TIME,        POS, OFF,  \
0 2015-07-15 16:41:56  0-0-0-0-3,   1,   
1 2015-07-15 16:42:55  0-0-0-0-3,   1,   
2 2015-07-15 16:43:55  0-0-0-0-3,   1,   
3 2015-07-15 16:44:56  0-0-0-0-3,   1,   

                                 HISTOGRAM  
0  2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,  
1          0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,  
2     0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,  
3      2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0  




#create new dataframe from column HISTOGRAM
h = pd.DataFrame([ x.split(',') for x in df['HISTOGRAM'].tolist()])
print h
  0  1  2   3  4  5  6  7  8  9  10 11 12  13 14 15    16    17    18    19
0  2  0  5  59  0  0  0  0  0  2  0  0  0   0  0  0     0     0     0      
1  0  0  5   9  0  0  0  0  0  2  0  0  0  50  0     None  None  None  None
2  0  0  5   5  0  0  0  0  0  2  0  0  0   0  4  0     0     0        None
3  2  0  5   0  0  0  0  0  0  2  0  0  0   6  0  0     0     0  None  None

#append to original, rename 0 column
df = pd.concat([df, h], axis=1).rename(columns={0:'HISTOGRAM'})
print df
                                 HISTOGRAM HISTOGRAM  1  2   3  4  5  ...  10  \
0  2,0,5,59,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,         2  0  5  59  0  0  ...   0   
1          0,0,5,9,0,0,0,0,0,2,0,0,0,50,0,         0  0  5   9  0  0  ...   0   
2     0,0,5,5,0,0,0,0,0,2,0,0,0,0,4,0,0,0,         0  0  5   5  0  0  ...   0   
3      2,0,5,0,0,0,0,0,0,2,0,0,0,6,0,0,0,0         2  0  5   0  0  0  ...   0   

  11 12  13 14 15    16    17    18    19  
0  0  0   0  0  0     0     0     0        
1  0  0  50  0     None  None  None  None  
2  0  0   0  4  0     0     0        None  
3  0  0   6  0  0     0     0  None  None  

[4 rows x 24 columns]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  傲寒        
                
              
                            
                2021-01-18 08:47
              
            
            
                                                                       
So how about this. I made a csv from your sample data. 

When I import lines:

with open('test.csv','rb') as f:
    lines = list(csv.reader(f))
headers, values =lines[0],lines[1:]


to generate nice header names, use this line:

headers = [i or ind for ind, i in enumerate(headers)]


so because of how (I assume) csv works, headers should have a bunch of empty string values. empty strings evaluate to False, so this comprehension returns numbered columns for each column without a header.

Then just make a df:

df = pd.DataFrame(values,columns=headers)


which looks like:

11:         SAMPLE_TIME           POS         OFF   HISTOGRAM  4  5   6  7  8  9  \
0  15/07/2015 16:41     0-0-0-0-3           1           2  0  5  59  0  0  0   
1  15/07/2015 16:42     0-0-0-0-3           1           0  0  5   9  0  0  0   
2  15/07/2015 16:43     0-0-0-0-3           1           0  0  5   5  0  0  0   
3  15/07/2015 16:44     0-0-0-0-3           1           2  0  5   0  0  0  0   

  ... 12 13 14 15  16 17 18 19 20 21  
0 ...  2  0  0  0   0  0  0  0  0  0  
1 ...  2  0  0  0  50  0              
2 ...  2  0  0  0   0  4  0  0  0     
3 ...  2  0  0  0   6  0  0  0  0     

[4 rows x 22 columns]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复