BeautifulSoup4 stripped_strings gives me byte objects?

前端 未结 1 821
隐瞒了意图╮
隐瞒了意图╮ 2021-01-21 15:16

I\'m trying to get the text out of a blockquote which looks like this:

01 Oyasumi

相关标签:
1条回答
  • 2021-01-21 16:10

    The problem is that Beautiful Soup converts the original encoding to Unicode if the from_encoding is not specified using a sub-library called Unicode, Dammit. More info in the Encodings section in the documentation.

    >>> from bs4 import BeautifulSoup
    >>> doc = '''<blockquote class="postcontent restore ">
    ...     01 Oyasumi
    ...     <br></br>
    ...     02 DanSin'
    ...     <br></br>
    ...     03 w.t.s.
    ...     <br></br>
    ...     04 Lovism
    ...     <br></br>
    ...     05 NoName
    ...     <br></br>
    ...     06 Gakkou
    ...     <br></br>
    ...     07 Happy☆Day
    ...     <br></br>
    ...     08 Endless End.
    ... </blockquote>'''
    >>> soup = BeautifulSoup(doc, 'html5lib')
    >>> soup.original_encoding 
    u'windows-1252'
    >>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
    >>> for line in content:
    ...     print(line)
    ... 
    01 Oyasumi
    02 DanSin'
    03 w.t.s.
    04 Lovism
    05 NoName
    06 Gakkou
    07 Happy☆Day
    08 Endless End.
    

    To fix this you have two options:

    1. By passing in the correct from_encoding parameter or excluding the wrong the wrong encoding Dammit is guessing. One problem is that not all Parsers support the exclude_encodings argument. For example the html5lib tree builder doesn't support exclude_encoding

      >>> soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')
      >>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
      >>> for line in content:
      ...     print(line)
      ... 
      01 Oyasumi
      02 DanSin'
      03 w.t.s.
      04 Lovism
      05 NoName
      06 Gakkou
      07 Happy☆Day
      08 Endless End.
      >>> 
      
    2. Use the lxml Parser

      >>> soup = BS(doc, 'lxml')
      >>> soup.original_encoding
      'utf-8'
      >>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
      >>> for line in content:
      ...     print(line)
      ... 
      01 Oyasumi
      02 DanSin'
      03 w.t.s.
      04 Lovism
      05 NoName
      06 Gakkou
      07 Happy☆Day
      08 Endless End.
      
    0 讨论(0)
提交回复
热议问题