I\'m using BeautifulSoup to parse some web pages.
Occasionally I run into a \"unicode hell\" error like the following :
Looking at the source of this article
You aren't encountering a problem. Everything is behaving as intended.
indicates a non-breaking space character. This isn't replaced with a space because it doesn't represent a space; it represents a non-breaking space. Replacing it with a space would lose information: that where that space occurs, a text rendering engine shouldn't put a line break.
The Unicode code point for non-breaking space is U+00A0, which is written in a Unicode string in Python as \xa0
.
The UTF-8 encoding of U+00A0 is, in hexadecimal, the two byte sequence C2 A0, or written in a Python string representation, \xc2\xa0
. In UTF-8, anything beyond the 7-bit ASCII set needs two or more bytes to represent it. In this case, the highest bit set is the eighth bit. That means that it can be represented by the two-byte sequence (in binary) 110xxxxx 10xxxxxx
where the x's are the bits of the binary representation of the code point. In the case of A0, that is 10000000
, or when encoded in UTF-8, 11000010 10000000
or C2 A0.
Many people use
in HTML to get spaces which aren't collapsed by the usual HTML whitespace collapsing rules (in HTML, all runs of consecutive spaces, tabs, and newlines get interpreted as a single space unless one of the CSS white-space rules are applied), but that's not really what they are intended for; they are supposed to be used for things like names, like "Mr. Miyagi", where you don't want there to be a line break between the "Mr." and "Miyagi". I'm not sure why it was used in this particular case; it seems out of place here, but that's more of a problem with your source, not the code that interprets it.
Now, if you don't really care about layout so you don't mind whether or not text layout algorithms choose that as a place to wrap, but would like to interpret this merely as a regular space, normalizing using NFKD is a perfectly reasonable answer (or NFKC if you prefer pre-composed accents to decomposed accents). The NFKC and NFKD normalizations map characters such that most characters that represent essentially the same semantic value in most contexts are expanded out. For instance, ligatures are expanded out (ffi -> ffi), archaic long s characters are converted into s (ſ -> s), Roman numeral characters are expanded into their individual letters (Ⅳ -> IV), and non-breaking space converted into a normal space. For some characters, NFKC or NFKD normalization may lose information that is important in some contexts: ℌ and ℍ will both normalize to H, but in mathematical texts can be used to refer to different things.