问题
I need to use beautiful soup to accomplish the following
Example HTML
<div id = "div1">
Text1
<div id="div2>
Text2
<div id="div3">
Text3
</div>
</div>
</div>
I need to do a search over this to return to me in separate instances of a list
Text1
Text2
Text3
I tried doing a findAll('div'), but it repeated the same Text multiple times ie it would return
Text1 Text2 Text3
Text2 Text3
Text3
回答1:
Well, you problem is that .text
also includes text from all the child nodes. You'll have to manually get only those text nodes that are immediate children of a node. Also, there might be multiple text nodes inside a given one, for example:
<div>
Hello
<div>
foobar
</div>
world!
</div>
How do you want them to be concatenated? Here is a function that joins them with a space:
def extract_text(node):
return ' '.join(t.strip() for t in node(text=True, recursive=False))
With my example:
In [27]: t = """
<div>
Hello
<div>
foobar
</div>
world!
</div>"""
In [28]: soup = BeautifulSoup(t)
In [29]: map(extract_text, soup('div'))
Out[29]: [u'Hello world!', u'foobar']
And your example:
In [32]: t = """
<div id = "div1">
Text1
<div id="div2">
Text2
<div id="div3">
Text3
</div>
</div>
</div>"""
In [33]: soup = BeautifulSoup(t)
In [34]: map(extract_text, soup('div'))
Out[34]: [u'Text1 ', u'Text2 ', u'Text3']
来源:https://stackoverflow.com/questions/17030605/find-all-text-within-1-level-in-html-using-beautiful-soup-python