BeautifulSoup `find_all` generator

后端未结

关注

 3  971

Is there any way to turn find_all into a more memory efficient generator? For example:

Given:

soup = BeautifulSoup(content, \"html.parser\


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  猫巷女王i        
                
              
                            
                2021-02-01 11:43
              
            
            
                                                                       
Document:


  I gave the generators PEP 8-compliant names, and transformed them into
  properties:


childGenerator() -> children
nextGenerator() -> next_elements
nextSiblingGenerator() -> next_siblings
previousGenerator() -> previous_elements
previousSiblingGenerator() -> previous_siblings
recursiveChildGenerator() -> descendants
parentGenerator() -> parents


There is chapter in the Document named Generators, you can read it.

SoupStrainer will only parse the part of html, it can save memory, but it only exclude the irrelevant tag, if you html has thounds of tag you want, it will result same memory problem. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  醉话见心        
                
              
                            
                2021-02-01 11:48
              
            
            
                                                                       
The simplest method is to use find_next:

soup = BeautifulSoup(content, "html.parser")

def find_iter(tagname):
    tag = soup.find(tagname)
    while tag is not None:
        yield tag
        tag = tag.find_next(tagname)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北海茫月        
                
              
                            
                2021-02-01 11:54
              
            
            
                                                                       
There is no "find" generator in BeautifulSoup, from what I know, but we can combine the use of SoupStrainer and .children generator.

Let's imagine we have this sample HTML:

<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>


from which we need to get the text of all item nodes.

We can use the SoupStrainer to parse only the item tags and then iterate over the .children generator and get the texts:

from bs4 import BeautifulSoup, SoupStrainer

data = """
<div>
    <item>Item 1</item>
    <item>Item 2</item>
    <item>Item 3</item>
    <item>Item 4</item>
    <item>Item 5</item>
</div>"""

parse_only = SoupStrainer('item')
soup = BeautifulSoup(data, "html.parser", parse_only=parse_only)
for item in soup.children:
    print(item.get_text())


Prints:

Item 1
Item 2
Item 3
Item 4
Item 5


In other words, the idea is to cut the tree down to the desired tags and use one of the available generators, like .children. You can also use one of these generators directly and manually filter the tag by name or other criteria inside the generator body, e.g. something like:

def generate_items(soup):
    for tag in soup.descendants:
        if tag.name == "item":
            yield tag.get_text()


The .descendants generates the children elements recursively, while .children would only consider direct children of a node.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复