How to cut html tag from very large multiline text file with content with use perl, sed or awk?

后端未结

关注

 4  1839

伪装坚强ぢ 2021-01-28 08:10

I want to transform this text (remove $.*?$ ) with sed, awk or perl:

{|
|-
| colspan=\"2\"|
:  $[\\underbrace{\\col (adsbygoogle = window.adsbygoogle || []).push({});$


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   臣服心动
                                             
                
                
                (楼主)
            
              
              
                2021-01-28 08:54
              

            
            
                        
This should do it:

perl -0777 -pe 's! $.*?$ !!sg' dirt-math.txt


-p says we're doing a sed-like readline/printline loop, -0777 says each "line" is actually the whole input file, and -e specifies the code to run (on each "line" (file)).



If your text files are too big to fit into memory (?!), you can try this:

perl -pe 's! $.*?$ !!s; if ($cut) { if (s!^.*?!!) { $cut = 0 } else { $_ = "" } } if (!$cut && s! $.*!!s) { $cut = 1 }' dirt-math.txt$ 

or (slightly more readable):

perl -pe '
    s! $.*?$ !!g;
    if ($cut) {
        if (s!^.*?!!) { $cut = 0 }
        else { $_ = "" }
    }
    if (!$cut && s! $.*!!s) { $cut = 1 }
' dirt-math.txt$ 

This is effectively a little state machine.

$cut records whether we're in an unclosed  tag (and so need to cut out input). If so, we check whether we were able to find/remove . If so, we're done cutting (we found a closing  tag); otherwise we overwrite the "current line" with the empty string ($_ = ""; this is the actual cutting part).

If, after this, we're not cutting (we're not using else to handle the case where ...  not math text  appears on a single line), we try to remove  $...$  from the input. If so, we've just seen an opening  tag and need to start cutting.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复