I\'m trying to get the text out of a blockquote which looks like this:
01 Oyasumi
The problem is that Beautiful Soup converts the original encoding to Unicode if the from_encoding
is not specified using a sub-library called Unicode, Dammit. More info in the Encodings section in the documentation.
>>> from bs4 import BeautifulSoup
>>> doc = '''<blockquote class="postcontent restore ">
... 01 Oyasumi
... <br></br>
... 02 DanSin'
... <br></br>
... 03 w.t.s.
... <br></br>
... 04 Lovism
... <br></br>
... 05 NoName
... <br></br>
... 06 Gakkou
... <br></br>
... 07 Happy☆Day
... <br></br>
... 08 Endless End.
... </blockquote>'''
>>> soup = BeautifulSoup(doc, 'html5lib')
>>> soup.original_encoding
u'windows-1252'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
... print(line)
...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.
To fix this you have two options:
By passing in the correct from_encoding
parameter or excluding the wrong the wrong encoding Dammit is guessing. One problem is that not all Parsers support the exclude_encodings
argument. For example the html5lib
tree builder doesn't support exclude_encoding
>>> soup = BeautifulSoup(doc, 'html5lib', from_encoding='utf-8')
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
... print(line)
...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.
>>>
Use the lxml Parser
>>> soup = BS(doc, 'lxml')
>>> soup.original_encoding
'utf-8'
>>> content = soup.find("blockquote", {"class": "postcontent restore "}).stripped_strings
>>> for line in content:
... print(line)
...
01 Oyasumi
02 DanSin'
03 w.t.s.
04 Lovism
05 NoName
06 Gakkou
07 Happy☆Day
08 Endless End.