How to cut html tag from very large multiline text file with content with use perl, sed or awk?

后端未结

关注

 4  1837

I want to transform this text (remove $.*?$ ) with sed, awk or perl:

{|
|-
| colspan=\"2\"|
:  $[\\underbrace{\\col (adsbygoogle = window.adsbygoogle || []).push({});$


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  臣服心动        
                
              
                            
                2021-01-28 08:54
              
            
            
                                                                       
This should do it:

perl -0777 -pe 's!<math>.*?</math>!!sg' dirt-math.txt


-p says we're doing a sed-like readline/printline loop, -0777 says each "line" is actually the whole input file, and -e specifies the code to run (on each "line" (file)).



If your text files are too big to fit into memory (?!), you can try this:

perl -pe 's!<math>.*?</math>!!s; if ($cut) { if (s!^.*?</math>!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s!<math>.*!!s) { $cut = 1 }' dirt-math.txt


or (slightly more readable):

perl -pe '
    s!<math>.*?</math>!!g;
    if ($cut) {
        if (s!^.*?</math>!!) { $cut = 0 }
        else { $_ = "" }
    }
    if (!$cut && s!<math>.*!!s) { $cut = 1 }
' dirt-math.txt


This is effectively a little state machine.

$cut records whether we're in an unclosed <math> tag (and so need to cut out input). If so, we check whether we were able to find/remove </math>. If so, we're done cutting (we found a closing </math> tag); otherwise we overwrite the "current line" with the empty string ($_ = ""; this is the actual cutting part).

If, after this, we're not cutting (we're not using else to handle the case where ... </math> not math text <math> appears on a single line), we try to remove <math>... from the input. If so, we've just seen an opening <math> tag and need to start cutting.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  青春惊慌失措        
                
              
                            
                2021-01-28 09:01
              
            
            
                                                                       
This isn't quite the one-liner but it does what you're looking for. As always there are many ways of doing this. But here I am using '|' as the records separator and ':' as the field separator. That allows me to iterate over the fields in a record that contains math and only print the fields that don't contain <math></math>.

BEGIN {RS="|";FS=":";ORS=""}

/math/ {
    for (i=1;i<=NF;i++) {
        if ($i ~ /math/) {print ":\n"}
        else {print $i}
    }
    print "|";next;
}

/^\}/ {
    print "}";
    next;
}

{
    print $0"|"
}

END {print "\n"}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  再見小時候        
                
              
                            
                2021-01-28 09:09
              
            
            
                                                                       
If all data is so nicely formatted as in your example, then your solution is very close. I modified it only slightly

in AWK:

sub(/<math>.*/, "") {print; cut=1}
/<\/math>/          {cut=0; next}
!cut

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  名媛妹妹        
                
              
                            
                2021-01-28 09:10
              
            
            
                                                                       
This can also be done using .. flip-flop(not range) operator without taking the whole file in memory and removing <math> from the starting point like:

perl -wlne 'unless(((/.*<math>/../<\/math>/)||0) > 1){s/<math>//;print}' your-file

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复