问题
Is there a way to do a DFT on a BeautifulSoup parse tree? I'm trying to do something like starting at the root, usually , get all the child elements and then for each child element get their children, etc until I hit a terminal node at which point I'll build my way back up the tree. Problem is I can't seem to find a method that will allow me to do this. I found the findChildren method but that seems to just put the entire page in a list multiple times with each subsequent entry getting reduced. I might be able to use this to do a traversal however other than the last entry in the list it doesn't appear there is any way to identify entries as terminal nodes or not. Any ideas?
回答1:
recursiveChildGenerator()
already does that:
soup = BeautifulSoup.BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
name = getattr(child, "name", None)
if name is not None:
print name
elif not child.isspace(): # leaf node, don't print spaces
print child
Output
For the html from @msalvadores's answer:
html
ul
li
Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
li
Aliquam tincidunt mauris eu risus.
li
Vestibulum auctor dapibus neque.
html
NOTE: html
is printed twice due to the example contains two opening <html>
tags.
回答2:
I think you can use the method "childGenerator" and recursively use this one to parse the tree in a DFT fashion.
def recursiveChildren(x):
if "childGenerator" in dir(x):
for child in x.childGenerator():
name = getattr(child, "name", None)
if name is not None:
print "[Container Node]",child.name
recursiveChildren(child)
else:
if not x.isspace(): #Just to avoid printing "\n" parsed from document.
print "[Terminal Node]",x
if __name__ == "__main__":
soup = BeautifulSoup(your_data)
for child in soup.childGenerator():
recursiveChildren(child)
With "childGenerator" in dir(x)
we make sure that an element is a container, terminal nodes such as NavigableStrings
are not containers and do not contain children.
For some example HTML like:
<html>
<ul>
<li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
<li>Aliquam tincidunt mauris eu risus.</li>
<li>Vestibulum auctor dapibus neque.</li>
</ul>
</html>
This scripts prints ...
[Container Node] ul
[Container Node] li
[Terminal Node] Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
[Container Node] li
[Terminal Node] Aliquam tincidunt mauris eu risus.
[Container Node] li
[Terminal Node] Vestibulum auctor dapibus neque.
来源:https://stackoverflow.com/questions/4814317/depth-first-traversal-on-beautifulsoup-parse-tree