BeautifulSoup - combine consecutive tags

前端 未结 2 1577
孤独总比滥情好
孤独总比滥情好 2021-01-19 14:11

I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:



        
相关标签:
2条回答
  • 2021-01-19 14:18

    Perhaps you could check if the b.previousSibling is a b tag, then append the inner text from the current node into that. After doing this - you should be able to remove the current node from the tree with b.decompose.

    0 讨论(0)
  • 2021-01-19 14:19

    The solution below combines text from all the selected <b> tags into one <b> of your choice and decomposes the others.

    If you only want to merge the text from consecutive tags follow Danny's approach.

    Code:

    from bs4 import BeautifulSoup
    
    html = '''
    <div id="wrapper">
      <b style="mso-bidi-font-weight:normal">
        <span style='font-size:14.0pt;mso-bidi-font-size:11.0pt;line-height:107%;font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>I</span>
      </b>
      <b style="mso-bidi-font-weight:normal">
        <span style='font-family:"Times New Roman",serif;mso-fareast-font-family:"Times New Roman"'>NTRODUCTION</span>
      </b>
    </div>
    '''
    
    soup = BeautifulSoup(html, 'lxml')
    container = soup.select_one('#wrapper')  # it contains b tags to combine
    b_tags = container.find_all('b')
    
    # combine all the text from b tags
    text = ''.join(b.get_text(strip=True) for b in b_tags)
    
    # here you choose a tag you want to preserve and update its text
    b_main = b_tags[0]  # you can target it however you want, I just take the first one from the list
    b_main.span.string = text  # replace the text
    
    for tag in b_tags:
        if tag is not b_main:
            tag.decompose()
    
    print(soup)
    

    Any comments appreciated.

    0 讨论(0)
提交回复
热议问题