What is the best way to remove accents (normalize) in a Python unicode string?

后端 未结 8 1588
感情败类
感情败类 2020-11-21 06:11

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. conve
8条回答
  •  梦毁少年i
    2020-11-21 06:56

    I just found this answer on the Web:

    import unicodedata
    
    def remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        only_ascii = nfkd_form.encode('ASCII', 'ignore')
        return only_ascii
    

    It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

    Edit: this does the trick:

    import unicodedata
    
    def remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
    

    unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

    Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

    encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
    byte_string = b"café"  # or simply "café" before python 3.
    unicode_string = byte_string.decode(encoding)
    

提交回复
热议问题