What is the best way to remove accents (normalize) in a Python unicode string?

后端 未结 8 1584
感情败类
感情败类 2020-11-21 06:11

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. conve
8条回答
  •  一整个雨季
    2020-11-21 06:40

    This handles not only accents, but also "strokes" (as in ø etc.):

    import unicodedata as ud
    
    def rmdiacritics(char):
        '''
        Return the base character of char, by "removing" any
        diacritics like accents or curls and strokes and the like.
        '''
        desc = ud.name(char)
        cutoff = desc.find(' WITH ')
        if cutoff != -1:
            desc = desc[:cutoff]
            try:
                char = ud.lookup(desc)
            except KeyError:
                pass  # removing "WITH ..." produced an invalid name
        return char
    

    This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed. In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

    There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

    EDIT NOTE:

    Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

提交回复
热议问题