Splitting Rows in csv on several header rows

前端未结

关注

 2  1406

I am very new to python, so please be gentle.

I have a .csv file, reported to me in this format, so I cannot do much about it:

ClientAccountID   Acc


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  谎友^        
                
              
                            
                2021-01-03 16:16
              
            
            
                                                                       
If I've understood your question correctly, you have a single CSV file which contains multiple tables. Tables are delimited by header rows which always begin with the string "ClientAccountID".

So the job is to read the CSV file into a list of lists-of-dictionaries. Each entry in the list corresponds to one of the tables in your CSV file.

Here's how I'd do it:


Break up the single CSV file with multiple tables into multiple files each with a single table. (These files could be in-memory.) Do this by looking for lines which start with "ClientAccountID".
Read each of these files into a list of dictionaries using a DictReader.


Here's some code to read the file into a list of StringIOs. (A StringIO is an in-memory file. It works by wrapping a string up into a file-like interface).

from csv import DictReader
from io import StringIO

stringios = []

with open('file.csv', 'r') as f:
    stringio = None
    for line in f:
        if line.startswith('ClientAccountID'):
            if stringio is not None:
                stringio.seek(0)
                stringios.append(stringio)
            stringio = StringIO()
        stringio.write(line)
        stringio.write("\n")
    stringio.seek(0)
    stringios.append(stringio)


If we encounter a line starting with 'ClientAccountID', we put the current StringIO into the list and start writing to a new one. When you've finished, remember to add the last one to the list too.
Don't forget (as I did, in an earlier version of this answer) to rewind the StringIO after you've written to it using stringio.seek(0). 

Now it's straightforward to loop over the StringIOs to get a table of dictionaries.

data = [list(DictReader(x, delimiter='\t')) for x in stringios]


For each file-like object in the list stringios, create a DictReader and read it into a list.

It's not too hard to modify this approach if your data is too big to fit into memory. Use generators instead of lists and do the processing line-by-line.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  刺人心        
                
              
                            
                2021-01-03 16:22
              
            
            
                                                                       
If your data was not comma or tab delimited you could use str.split, you can combine it with itertools.groupby to delimit the headers and rows:

from itertools import groupby, izip, imap

with open("test.txt") as f:
    grps, data = groupby(imap(str.split, f), lambda x: x[0] == "ClientAccountID"), []
    for k, v in grps:
        if k:
            names = next(v)
            vals = izip(*next(grps)[1])
            data.append(dict(izip(names, vals)))

from pprint import pprint as pp

pp(data)


Output:

[{'AccountAlias': ('SomeAlias', 'OtherAlias'),
  'ClientAccountID': ('SomeID', 'OtherID'),
  'CurrencyPrimary': ('SomeCurr', 'OtherCurr'),
  'FromDate': ('SomeDate', 'OtherDate')},
 {'AccountAlias': ('SomeAlias', 'OtherAlias', 'AnotherAlias'),
  'AssetClass': ('SomeClass', 'OtherDate', 'AnotherDate'),
  'ClientAccountID': ('SomeID', 'OtherID', 'AnotherID'),
  'CurrencyPrimary': ('SomeCurr', 'OtherCurr', 'AnotherCurr')}]


If it is tab delimited just change one line:

with open("test.txt") as f:
    grps, data = groupby(csv.reader(f, delimiter="\t"), lambda x: x[0] == "ClientAccountID"), []
    for k, v in grps:
        if k:
            names = next(v)
            vals = izip(*next(grps)[1])
            data.append(dict(izip(names, vals)))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复