Using BeautifulSoup to grab all the HTML between two tags

前端未结

关注

 4  1358

I have some HTML that looks like this:

Title

//a random amount of p/uls or tagless text

 Next Title

相关标签:

4条回答

庸人自扰

2020-12-25 13:30
I have the same problem. Not sure if there is a better solution, but what I've done is use regular expressions to get the indices of the two nodes that I'm looking for. Once I have that, I extract the HTML between the two indexes and create a new BeautifulSoup object.

Example:
```
m = re.search(r'<h1>Title</h1>.*?<h1>', html, re.DOTALL)
s = m.start()
e = m.end() - len('<h1>')
target_html = html[s:e]
new_bs = BeautifulSoup(target_html)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-12-25 13:31
Interesting question. There is no way you can use just DOM to select it. You'll have to loop trough all elements preceding the first h1 (including) and put them into intro = str(intro), then get everything up to the 2nd h1 into chapter1. Than remove the intro from the chapter1 using
```
chapter = chapter1.replace(intro, '')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

没有蜡笔的小新

2020-12-25 13:39

Here is a complete, up-to-date solution:

Contents of temp.html:

<h1>Title</h1>
<p>hi</p>
//a random amount of p/uls or tagless text
<h1> Next Title</h1>

Code:

import copy

from bs4 import BeautifulSoup

with open("resources/temp.html") as file_in:
    soup = BeautifulSoup(file_in, "lxml")

print(f"Before:\n{soup.prettify()}")

first_header = soup.find("body").find("h1")

siblings_to_add = []

for curr_sibling in first_header.next_siblings:
    if curr_sibling.name == "h1":
        for curr_sibling_to_add in siblings_to_add:
            curr_sibling.insert_after(curr_sibling_to_add)
        break
    else:
        siblings_to_add.append(copy.copy(curr_sibling))

print(f"\nAfter:\n{soup.prettify()}")

Output:

Before:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
 </body>
</html>

After:
<html>
 <body>
  <h1>
   Title
  </h1>
  <p>
   hi
  </p>
  //a random amount of p/uls or tagless text
  <h1>
   Next Title
  </h1>
  //a random amount of p/uls or tagless text
  <p>
   hi
  </p>
 </body>
</html>

0 讨论(0)

时光说笑

2020-12-25 13:44
This is the clear BeautifulSoup way, when the second h1 tag is a sibling of the first:
```
html = u""
for tag in soup.find("h1").next_siblings:
    if tag.name == "h1":
        break
    else:
        html += unicode(tag)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...