I have some HTML that looks like this:
Title
//a random amount of p/uls or tagless text
Next Title
I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.
Example:
m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)
Interesting question. There is no way you can use just DOM to select it. You'll have to loop trough all elements preceding the first h1 (including) and put them into intro = str(intro), then get everything up to the 2nd h1 into chapter1. Than remove the intro from the chapter1 using
chapter = chapter1.replace(intro, '')
Here is a complete, up-to-date solution:
Contents of temp.html
:
<h1>Title</h1>
<p>hi</p>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>
Code:
import copy
from bs4 import BeautifulSoup
with open("resources/temp.html") as file_in:
soup = BeautifulSoup(file_in, "lxml")
print(f"Before:\n{soup.prettify()}")
first_header = soup.find("body").find("h1")
siblings_to_add = []
for curr_sibling in first_header.next_siblings:
if curr_sibling.name == "h1":
for curr_sibling_to_add in siblings_to_add:
curr_sibling.insert_after(curr_sibling_to_add)
break
else:
siblings_to_add.append(copy.copy(curr_sibling))
print(f"\nAfter:\n{soup.prettify()}")
Output:
Before:
<html>
<body>
<h1>
Title
</h1>
<p>
hi
</p>
//a random amount of p/uls or tagless text
<h1>
Next Title
</h1>
</body>
</html>
After:
<html>
<body>
<h1>
Title
</h1>
<p>
hi
</p>
//a random amount of p/uls or tagless text
<h1>
Next Title
</h1>
//a random amount of p/uls or tagless text
<p>
hi
</p>
</body>
</html>
This is the clear BeautifulSoup way, when the second h1
tag is a sibling of the first:
html = u""
for tag in soup.find("h1").next_siblings:
if tag.name == "h1":
break
else:
html += unicode(tag)