Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?
E.g. turn My International Text: åäö
I'll throw my own (partial) solution here too:
import unicodedata
def deaccent(some_unicode_string):
return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
if unicodedata.category(c) != 'Mn')
This does not do all you want, but gives a few nice tricks wrapped up in a convenience method: unicode.normalise('NFD', some_unicode_string)
does a decomposition of unicode characters, for example, it breaks 'ä' to two unicode codepoints U+03B3
and U+0308
.
The other method, unicodedata.category(char)
, returns the enicode character category for that particular char
. Category Mn
contains all combining accents, thus deaccent
removes all accents from the words.
But note, that this is just a partial solution, it gets rid of accents. You still need some sort of whitelist of characters you want to allow after this.