BeautifulSoup - combine consecutive tags

前端未结

关注

 2  1577

I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  有刺的猬        
                
              
                            
                2021-01-19 14:18
              
            
            
                                                                       
Perhaps you could check if the b.previousSibling is a b tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我在风中等你        
                
              
                            
                2021-01-19 14:19
              
            
            
                                                                       
The solution below combines text from all the selected <b> tags into one <b> of your choice and decomposes the others.

If you only want to merge the text from consecutive tags follow Danny's approach.

Code:

from bs4 import BeautifulSoup

html = '''
<div id="wrapper">
  <b style="mso-bidi-font-weight:normal">
    <span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span>
  </b>
  <b style="mso-bidi-font-weight:normal">
    <span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span>
  </b>
</div>
'''

soup = BeautifulSoup(html, 'lxml')
container = soup.select_one('#wrapper')  # it contains b tags to combine
b_tags = container.find_all('b')

# combine all the text from b tags
text = ''.join(b.get_text(strip=True) for b in b_tags)

# here you choose a tag you want to preserve and update its text
b_main = b_tags[0]  # you can target it however you want, I just take the first one from the list
b_main.span.string = text  # replace the text

for tag in b_tags:
    if tag is not b_main:
        tag.decompose()

print(soup)


Any comments appreciated.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复