Find All text within 1 level in HTML using Beautiful Soup - Python

问题

I need to use beautiful soup to accomplish the following

Example HTML

<div id = "div1">
 Text1
 <div id="div2>
   Text2
   <div id="div3">
    Text3
   </div>
 </div>
</div>

I need to do a search over this to return to me in separate instances of a list

Text1
Text2
Text3

I tried doing a findAll('div'), but it repeated the same Text multiple times ie it would return

Text1 Text2 Text3
Text2 Text3
Text3

回答1:

Well, you problem is that .text also includes text from all the child nodes. You'll have to manually get only those text nodes that are immediate children of a node. Also, there might be multiple text nodes inside a given one, for example:

<div>
    Hello
        <div>
            foobar
        </div>
    world!
</div>

How do you want them to be concatenated? Here is a function that joins them with a space:

def extract_text(node):
    return ' '.join(t.strip() for t in node(text=True, recursive=False))

With my example:

In [27]: t = """
<div>
    Hello
        <div>
            foobar
        </div>
    world!
</div>"""

In [28]: soup = BeautifulSoup(t)

In [29]: map(extract_text, soup('div'))
Out[29]: [u'Hello world!', u'foobar']

And your example:

In [32]: t = """
<div id = "div1">
 Text1
 <div id="div2">
   Text2
   <div id="div3">
    Text3
   </div>
 </div>
</div>"""

In [33]: soup = BeautifulSoup(t)

In [34]: map(extract_text, soup('div'))
Out[34]: [u'Text1 ', u'Text2 ', u'Text3']

来源：https://stackoverflow.com/questions/17030605/find-all-text-within-1-level-in-html-using-beautiful-soup-python

标签

python

html-parsing

beautifulsoup

findall

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!