BeautifulSoup `find_all` generator

后端 未结 3 967
情书的邮戳
情书的邮戳 2021-02-01 11:17

Is there any way to turn find_all into a more memory efficient generator? For example:

Given:

soup = BeautifulSoup(content, \"html.parser\         


        
3条回答
  •  北海茫月
    2021-02-01 11:54

    There is no "find" generator in BeautifulSoup, from what I know, but we can combine the use of SoupStrainer and .children generator.

    Let's imagine we have this sample HTML:

    Item 1 Item 2 Item 3 Item 4 Item 5

    from which we need to get the text of all item nodes.

    We can use the SoupStrainer to parse only the item tags and then iterate over the .children generator and get the texts:

    from bs4 import BeautifulSoup, SoupStrainer
    
    data = """
    
    Item 1 Item 2 Item 3 Item 4 Item 5
    """ parse_only = SoupStrainer('item') soup = BeautifulSoup(data, "html.parser", parse_only=parse_only) for item in soup.children: print(item.get_text())

    Prints:

    Item 1
    Item 2
    Item 3
    Item 4
    Item 5
    

    In other words, the idea is to cut the tree down to the desired tags and use one of the available generators, like .children. You can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, e.g. something like:

    def generate_items(soup):
        for tag in soup.descendants:
            if tag.name == "item":
                yield tag.get_text()
    

    The .descendants generates the children elements recursively, while .children would only consider direct children of a node.

提交回复
热议问题