What is the best way to remove accents (normalize) in a Python unicode string?

后端 未结 8 1553
感情败类
感情败类 2020-11-21 06:11

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. conve
相关标签:
8条回答
  • 2020-11-21 06:56

    I just found this answer on the Web:

    import unicodedata
    
    def remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        only_ascii = nfkd_form.encode('ASCII', 'ignore')
        return only_ascii
    

    It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.

    Edit: this does the trick:

    import unicodedata
    
    def remove_accents(input_str):
        nfkd_form = unicodedata.normalize('NFKD', input_str)
        return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
    

    unicodedata.combining(c) will return true if the character c can be combined with the preceding character, that is mainly if it's a diacritic.

    Edit 2: remove_accents expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:

    encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
    byte_string = b"café"  # or simply "café" before python 3.
    unicode_string = byte_string.decode(encoding)
    
    0 讨论(0)
  • 2020-11-21 06:57

    In response to @MiniQuark's answer:

    I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats. As a test, I created a test.txt file that looked like this:

    Montréal, über, 12.89, Mère, Françoise, noël, 889

    I had to include lines 2 and 3 to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:

    import sys 
    reload(sys) 
    sys.setdefaultencoding("utf-8")
    import csv
    import unicodedata
    
    def remove_accents(input_str):
        nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
        return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
    
    with open('test.txt') as f:
        read = csv.reader(f)
        for row in read:
            for element in row:
                print remove_accents(element)
    

    The result:

    Montreal
    uber
    12.89
    Mere
    Francoise
    noel
    889
    

    (Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)

    0 讨论(0)
提交回复
热议问题