Normalizing unicode text to filenames, etc. in Python

前端 未结 5 1022
情书的邮戳
情书的邮戳 2021-02-01 05:42

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö

5条回答
  •  被撕碎了的回忆
    2021-02-01 06:24

    I'll throw my own (partial) solution here too:

    import unicodedata
    
    def deaccent(some_unicode_string):
        return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
                   if unicodedata.category(c) != 'Mn')
    

    This does not do all you want, but gives a few nice tricks wrapped up in a convenience method: unicode.normalise('NFD', some_unicode_string) does a decomposition of unicode characters, for example, it breaks 'ä' to two unicode codepoints U+03B3 and U+0308.

    The other method, unicodedata.category(char), returns the enicode character category for that particular char. Category Mn contains all combining accents, thus deaccent removes all accents from the words.

    But note, that this is just a partial solution, it gets rid of accents. You still need some sort of whitelist of characters you want to allow after this.

提交回复
热议问题