Splitting one large file into many if the text in a column doesn't match the text in the one before it

后端未结

关注

 4  1005

I searched for awhile and couldn\'t find a response to this. I have a standard tsv file with the following format:

1    100    101    350    A
1    101    102


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  甜味超标        
                
              
                            
                2021-01-26 06:40
              
            
            
                                                                       
So using grep, something like:

for L in `grep -oE '[A-Z]+$'|uniq|sort|uniq`
do
grep -E ${L}'$' > file.${L}.txt
done 


The phrase grep -oE '[A-Z]+$'|uniq|sort|uniq should find all the unique keys, which you then use to re-parse the file multiple times. The sequence uniq|sort|uniq is to reduce the input to sort.

If you really need to do it in a single pass, then you could process each line, and append it immediately to the appropriate output file.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2021-01-26 06:45
              
            
            
                                                                       
Here is a small solution in python using groupby and str.rpartition:

from itertools import groupby

with open("in_file.txt") as f_in:
for name,lines in groupby(f_in.readlines(),key=lambda x:x.rpartition(" ")[2].strip()):
        with open(f"out_{name}.txt","w") as f_out:
            f_out.writelines(lines)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  忘了有多久        
                
              
                            
                2021-01-26 06:50
              
            
            
                                                                       
So the scripting single-pass low memory line-by-line approach:

while IFS=" " read -r value1 value2 value3 value4 value5 remainder
do
  echo $value1 $value2 $value3 $value4 $value5 $remainder >> output.${value5}.txt
done < "input.txt"


Of course, you need to ensure there are no pre-existing output files, but that can be achieved a number of ways efficiently.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2021-01-26 06:54
              
            
            
                                                                       
With awk

awk '{out = "File" $NF ".txt"; print >> out; close(out)}' file


More efficient, not closing the destination file after every line:

awk '
    $NF != dest {if (out) close(out); dest = $NF; out = "File" dest ".txt"} 
    {print >> out}
' file

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复