read_csv with missing/incomplete header or irregular number of columns

前端未结

关注

 4  1860

小鲜肉 2021-01-18 08:12

I have a file.csv with ~15k rows that looks like this

SAMPLE_TIME,          POS,        OFF,  HISTOGRAM
2015-07-15 16:41:56,  0-0-0-0-3,   1,


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   别那么骄傲
                                             
                
                
                (楼主)
            
              
              
                2021-01-18 08:35
              

            
            
                        
You can create columns based on the length of the first actual row:

from tempfile import TemporaryFile
with open("out.txt") as f, TemporaryFile("w+") as t:
    h, ln = next(f), len(next(f).split(","))
    header = h.strip().split(",")
    f.seek(0), next(f)
    header += range(ln)
    print(pd.read_csv(f, names=header))


Which will give you:

          SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   

   4  5 ...  13  14  15  16  17  18  19  20  21  22  
0  0  0 ...   0   0   0   0   0 NaN NaN NaN NaN NaN  
1  0  0 ...   0 NaN NaN NaN NaN NaN NaN NaN NaN NaN  
2  0  0 ...   4   0   0   0 NaN NaN NaN NaN NaN NaN  
3  0  0 ...   0   0   0   0 NaN NaN NaN NaN NaN NaN  

[4 rows x 27 columns]


Or you could clean the file before passing to pandas:

import pandas as pd

from tempfile import TemporaryFile
with open("in.csv") as f, TemporaryFile("w+") as t:
    for line in f:
        t.write(line.replace(" ", ""))
    t.seek(0)
    ln = len(line.strip().split(","))
    header = t.readline().strip().split(",")
    header += range(ln)
    print(pd.read_csv(t,names=header))


Which gives you:

          SAMPLE_TIME        POS  OFF  HISTOGRAM  0  1   2  3  4  5 ...  11  \
0  2015-07-1516:41:56  0-0-0-0-3    1          2  0  5  59  0  0  0 ...   0   
1  2015-07-1516:42:55  0-0-0-0-3    1          0  0  5   9  0  0  0 ...   0   
2  2015-07-1516:43:55  0-0-0-0-3    1          0  0  5   5  0  0  0 ...   0   
3  2015-07-1516:44:56  0-0-0-0-3    1          2  0  5   0  0  0  0 ...   0   

   12  13  14  15  16  17  18  19  20  
0   0   0   0   0   0   0 NaN NaN NaN  
1  50   0 NaN NaN NaN NaN NaN NaN NaN  
2   0   4   0   0   0 NaN NaN NaN NaN  
3   6   0   0   0   0 NaN NaN NaN NaN  

[4 rows x 25 columns]


or to drop the columns will all nana:

print(pd.read_csv(f, names=header).dropna(axis=1,how="all"))


Gives you:

           SAMPLE_TIME           POS          OFF    HISTOGRAM  0  1   2  3  \
0  2015-07-15 16:41:56     0-0-0-0-3            1            2  0  5  59  0   
1  2015-07-15 16:42:55     0-0-0-0-3            1            0  0  5   9  0   
2  2015-07-15 16:43:55     0-0-0-0-3            1            0  0  5   5  0   
3  2015-07-15 16:44:56     0-0-0-0-3            1            2  0  5   0  0   

   4  5 ...  8  9  10  11  12  13  14  15  16  17  
0  0  0 ...  2  0   0   0   0   0   0   0   0   0  
1  0  0 ...  2  0   0   0  50   0 NaN NaN NaN NaN  
2  0  0 ...  2  0   0   0   0   4   0   0   0 NaN  
3  0  0 ...  2  0   0   0   6   0   0   0   0 NaN  

[4 rows x 22 columns]

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复