Strip all HTML tags except links

后端未结

关注

 6  915

I am trying to write a regular expression to strip all HTML with the exception of links (the and tags respectively. It does n


                      
              相关标签:


      
      
        
          6条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2020-11-29 03:54
              
            
            
                                                                       
How about

<[^a](.|\n)+?>


?
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  囚心锁ツ        
                
              
                            
                2020-11-29 03:56
              
            
            
                                                                       
In general there are problems with this approach. Regexes are best for 'flat' text matches - nested data pushes regex engines into areas for which they are not designed. General HTML parsing needs a parser not a regex engine (Google for the difference between regular and context-free languages if you want the full technical details). 

It is easy to strip out all tags by replacing /</ and />/ with the empty string or their entity equivalents but selectively filtering HTML using regexes will be vulnerable to a wide range of accidental or malicious inputs breaking things.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2020-11-29 04:05
              
            
            
                                                                       
I keep going on about it, but there's no way I can recommend regexr too often. It's fantastic for testing this type of things.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  予麋鹿        
                
              
                            
                2020-11-29 04:10
              
            
            
                                                                       
Here you go:

{<(?!i|b|h[1-6]|/i|/b|/h[1-6][\s|>|/])[^>]*>}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  轻奢々        
                
              
                            
                2020-11-29 04:15
              
            
            
                                                                       
<(?!\/?a(?=>|\s.*>))\/?.*?>


Try this.  Had something similar for p tags.  Worked for them so don't see why not.  Uses negative lookahead to check that it doesn't match a (prefixed with an optional / character) where (using positive lookahead) a (with optional / prefix) is followed by a > or a space, stuff and then >.  This then matches up until the next > character.  Put this in a subst with 

s/<(?!\/?a(?=>|\s.*>))\/?.*?>//g;


This should leave only the opening and closing a tags
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  说谎        
                
              
                            
                2020-11-29 04:15
              
            
            
                                                                       
strip_tags() does this.

Here, I am including all <a><p><font><b><i><sup> tags and outputting a tidied version:

cat input.htm | tr -d '\n' | php -r '$input=fgets(STDIN); echo strip_tags($input,"<a><p><font><b><i><sup>");' | tidy -i -wrap 0 -o output.htm

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复