Parse html with jsoup and remove the tag block

后端 未结 4 1985
傲寒
傲寒 2021-01-05 06:00

I want to remove everything between a tag. An example input may be

Input:


  start
  
delete from below
相关标签:
4条回答
  • 2021-01-05 06:09

    Try this code :

    String data = null;
        BufferedReader br = new BufferedReader(new FileReader("e://XMLFile.xml"));
        StringBuilder builder = new StringBuilder();
        while ((data = br.readLine()) != null) {
            builder.append(data);
        }
        System.out.println(builder);
        String replaceAll = builder.toString().replaceAll("<div class=\"XYZ\".+?</div>", "");
        System.out.println(replaceAll);
    

    I have read the input XML from a file and stored it in a StringBuilder object by reading it line by line, and then replaced the entire tag will empty string.

    0 讨论(0)
  • 2021-01-05 06:23

    I asked this problem yesterday and thanks to ollo's answer. It was solved. There is en extension of the above problem. I did not know if I have to start a new post or chain this one. So, in this confusion I am chaining it here.. Admins pls, pardon me if I had to make a separate post for this.

    In the above problem, I have to remove a tag block with matching component.

    The real scenario is: It should remove the tag block with matching component + remove <br /> surrounding it.

    Referring to the above example.

    <body>
      start
      <div>
        delete from below
        <br />
        <br />
        <div class="XYZ">
          first div having this class
          <div>
            waste
          </div>
          <div class="XYZ">
            second div having this class
          </div>
          waste
        </div>
        <br />
        delete till above
      </div>
      <div>
        this will also remain
      </div>
      end
    </body>
    

    should also give the same output:

    <body>
      start
      <div>
        delete from below
        delete till above
      </div>
      <div>
        this will also remain
      </div>
      end
    </body>
    

    Because it has <br /> above and below the html tag block to remove....

    Just to re-iterate, I am using the solution given by ollo to match and remove the tag block.

    for( Element element : doc.select("div.XYZ") )
    {
        element.remove();
    }
    

    Thanks, Shekhar

    0 讨论(0)
  • 2021-01-05 06:24

    You better iterate over all elements found. so you can be shure that

    • a.) all elements are removed and
    • b.) there's nothing done if there's no element.

    Example:

    Document doc = ...
    
    for( Element element : doc.select("div.XYZ") )
    {
        element.remove();
    }
    

    Edit:

    ( An addition to my comment )

    Don't use exception handling when a simple null- / range check is enough here:

    doc.select("div.XYZ").first().remove();
    

    instead:

    Elements divs = doc.select("div.XYZ");
    
    if( !divs.isEmpty() )
    {
        /*
         * Here it's safe to call 'first()' since there at least one element.
         */
    }
    
    0 讨论(0)
  • 2021-01-05 06:29

    This may help you.

     String selectTags="div,li,p,ul,ol,span,table,tr,td,address,em";
     /*selecting some specific tags */
     Elements webContentElements = parsedDoc.select(selectTags); 
     String removeTags = "img,a,form"; 
     /*Removing some tags from selected elements*/
     webContentElements.select(removeTags).remove();
    
    0 讨论(0)
提交回复
热议问题