I have the following code:
import unicodedata
my_var = \"this is a string\"
my_var2 = \" Esta es una oración que está en español \"
my_var3 = unicodedata.nor
In Python 3, string.encode()
creates a byte string, which cannot be mixed with a regular string. You have to convert the result back to a string again; the method is predictably called decode
.
my_var3 = unicodedata.normalize('NFKD', my_var2).encode('ascii', 'ignore').decode('ascii')
In Python 2, there was no hard distinction between Unicode strings and "regular" (byte) strings, but that meant many hard-to-catch bugs were introduced when programmers had careless assumptions about the encoding of strings they were manipulating.
As for what the normalization does, it makes sure characters which look identical actually are identical. For example, ñ can be represented either as the single code point U+00F1 LATIN SMALL LETTER N WITH TILDE or as the combining sequence U+006E LATIN SMALL LETTER N followed by U+0303 COMBINING TILDE. Normalization converts these so that every variation is coerced into the same representation (the D normalization prefers the decomposed, combining sequence) so that strings which represent the same text are also guaranteed to contain exactly the same code points.
Because decomposed characters in many Latin-based languages are often a sequence of a plain ASCII character followed by a number of combining diacritics which are not legacy ASCII characters, converting the string to 7-bit ASCII with the 'ignore'
error handler will often strip accents but leave the text almost readable. Götterdämmerung gets converted to Gotterdammerung etc.