Using BeautifulSoup to grab all the HTML between two tags

前端 未结 4 1358
情深已故
情深已故 2020-12-25 12:57

I have some HTML that looks like this:

Title

//a random amount of p/uls or tagless text

Next Title

相关标签:
4条回答
  • 2020-12-25 13:30

    I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.

    Example:

    m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
    s = m.start()
    e = m.end() - len('<h1>')
    target_html = html[s:e]
    new_bs = BeautifulSoup(target_html)
    
    0 讨论(0)
  • 2020-12-25 13:31

    Interesting question. There is no way you can use just DOM to select it. You'll have to loop trough all elements preceding the first h1 (including) and put them into intro = str(intro), then get everything up to the 2nd h1 into chapter1. Than remove the intro from the chapter1 using

    chapter = chapter1.replace(intro, '')
    
    0 讨论(0)
  • 2020-12-25 13:39

    Here is a complete, up-to-date solution:

    Contents of temp.html:

    <h1>Title</h1>
    <p>hi</p>
    //a random amount of p/uls or tagless text
    <h1> Next Title</h1>
    

    Code:

    import copy
    
    from bs4 import BeautifulSoup
    
    with open("resources/temp.html") as file_in:
        soup = BeautifulSoup(file_in, "lxml")
    
    print(f"Before:\n{soup.prettify()}")
    
    first_header = soup.find("body").find("h1")
    
    siblings_to_add = []
    
    for curr_sibling in first_header.next_siblings:
        if curr_sibling.name == "h1":
            for curr_sibling_to_add in siblings_to_add:
                curr_sibling.insert_after(curr_sibling_to_add)
            break
        else:
            siblings_to_add.append(copy.copy(curr_sibling))
    
    print(f"\nAfter:\n{soup.prettify()}")
    

    Output:

    Before:
    <html>
     <body>
      <h1>
       Title
      </h1>
      <p>
       hi
      </p>
      //a random amount of p/uls or tagless text
      <h1>
       Next Title
      </h1>
     </body>
    </html>
    
    After:
    <html>
     <body>
      <h1>
       Title
      </h1>
      <p>
       hi
      </p>
      //a random amount of p/uls or tagless text
      <h1>
       Next Title
      </h1>
      //a random amount of p/uls or tagless text
      <p>
       hi
      </p>
     </body>
    </html>
    
    0 讨论(0)
  • 2020-12-25 13:44

    This is the clear BeautifulSoup way, when the second h1 tag is a sibling of the first:

    html = u""
    for tag in soup.find("h1").next_siblings:
        if tag.name == "h1":
            break
        else:
            html += unicode(tag)
    
    0 讨论(0)
提交回复
热议问题