Parse html with jsoup and remove the tag block

后端未结

关注

 4  1985

I want to remove everything between a tag. An example input may be

Input:


  start
  
    delete from below
    

        
                      
              相关标签:
       
      
      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  独厮守ぢ        
                
              
                            
                2021-01-05 06:09
              
            
            
                                                                       
Try this code :

String data = null;
    BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
    StringBuilder builder = new StringBuilder();
    while ((data = br.readLine()) != null) {
        builder.append(data);
    }
    System.out.println(builder);
    String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
    System.out.println(replaceAll);


I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire  tag will empty string.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-01-05 06:23
              
            
            
                                                                       
I asked this problem yesterday and thanks to ollo's answer. It was solved.
There is en extension of the above problem. I did not know if I have to start a new post or chain this one. So, in this confusion I am chaining it here.. Admins pls, pardon me if I had to make a separate post for this.

In the above problem, I have to remove a tag block with matching component.

The real scenario is:
It should remove the tag block with matching component + remove <br /> surrounding it.

Referring to the above example.

<body>
  start
  <div>
    delete from below
    <br />
    <br />
    <div class="XYZ">
      first div having this class
      <div>
        waste
      </div>
      <div class="XYZ">
        second div having this class
      </div>
      waste
    </div>
    <br />
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>


should also give the same output:

<body>
  start
  <div>
    delete from below
    delete till above
  </div>
  <div>
    this will also remain
  </div>
  end
</body>


Because it has <br /> above and below the html tag block to remove....

Just to re-iterate, I am using the solution given by ollo to match and remove the tag block.

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}


Thanks,
Shekhar
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  栀梦        
                
              
                            
                2021-01-05 06:24
              
            
            
                                                                       
You better iterate over all elements found. so you can be shure that


a.) all elements are removed and  
b.) there's nothing done if there's no element.


Example:

Document doc = ...

for( Element element : doc.select("div.XYZ") )
{
    element.remove();
}




Edit:

( An addition to my comment )

Don't use exception handling when a simple null- / range check is enough here:

doc.select("div.XYZ").first().remove();


instead:

Elements divs = doc.select("div.XYZ");

if( !divs.isEmpty() )
{
    /*
     * Here it's safe to call 'first()' since there at least one element.
     */
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-05 06:29
              
            
            
                                                                       
This may help you.

 String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
 /*selecting some specific tags */
 Elements webContentElements = parsedDoc.select(selectTags); 
 String removeTags = "img,a,form"; 
 /*Removing some tags from selected elements*/
 webContentElements.select(removeTags).remove();

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复