What is the best way to remove accents (normalize) in a Python unicode string?

后端 未结 8 1560
感情败类
感情败类 2020-11-21 06:11

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. conve
相关标签:
8条回答
  • 2020-11-21 06:39

    Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

    Thanks to you, I have created this function that works wonders.

    import re
    import unicodedata
    
    def strip_accents(text):
        """
        Strip accents from input String.
    
        :param text: The input string.
        :type text: String.
    
        :returns: The processed String.
        :rtype: String.
        """
        try:
            text = unicode(text, 'utf-8')
        except (TypeError, NameError): # unicode is a default on python 3 
            pass
        text = unicodedata.normalize('NFD', text)
        text = text.encode('ascii', 'ignore')
        text = text.decode("utf-8")
        return str(text)
    
    def text_to_id(text):
        """
        Convert input text to id.
    
        :param text: The input string.
        :type text: String.
    
        :returns: The processed String.
        :rtype: String.
        """
        text = strip_accents(text.lower())
        text = re.sub('[ ]+', '_', text)
        text = re.sub('[^0-9a-zA-Z_-]', '', text)
        return text
    

    result:

    text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
    >>> 'montreal_uber_1289_mere_francoise_noel_889'
    
    0 讨论(0)
  • 2020-11-21 06:40

    This handles not only accents, but also "strokes" (as in ø etc.):

    import unicodedata as ud
    
    def rmdiacritics(char):
        '''
        Return the base character of char, by "removing" any
        diacritics like accents or curls and strokes and the like.
        '''
        desc = ud.name(char)
        cutoff = desc.find(' WITH ')
        if cutoff != -1:
            desc = desc[:cutoff]
            try:
                char = ud.lookup(desc)
            except KeyError:
                pass  # removing "WITH ..." produced an invalid name
        return char
    

    This is the most elegant way I can think of (and it has been mentioned by alexis in a comment on this page), although I don't think it is very elegant indeed. In fact, it's more of a hack, as pointed out in comments, since Unicode names are – really just names, they give no guarantee to be consistent or anything.

    There are still special letters that are not handled by this, such as turned and inverted letters, since their unicode name does not contain 'WITH'. It depends on what you want to do anyway. I sometimes needed accent stripping for achieving dictionary sort order.

    EDIT NOTE:

    Incorporated suggestions from the comments (handling lookup errors, Python-3 code).

    0 讨论(0)
  • 2020-11-21 06:43

    gensim.utils.deaccent(text) from Gensim - topic modelling for humans:

    'Sef chomutovskych komunistu dostal postou bily prasek'
    

    Another solution is unidecode.

    Note that the suggested solution with unicodedata typically removes accents only in some character (e.g. it turns 'ł' into '', rather than into 'l').

    0 讨论(0)
  • 2020-11-21 06:46

    Unidecode is the correct answer for this. It transliterates any unicode string into the closest possible representation in ascii text.

    Example:

    accented_string = u'Málaga'
    # accented_string is of type 'unicode'
    import unidecode
    unaccented_string = unidecode.unidecode(accented_string)
    # unaccented_string contains 'Malaga'and is of type 'str'
    
    0 讨论(0)
  • 2020-11-21 06:46

    How about this:

    import unicodedata
    def strip_accents(s):
       return ''.join(c for c in unicodedata.normalize('NFD', s)
                      if unicodedata.category(c) != 'Mn')
    

    This works on greek letters, too:

    >>> strip_accents(u"A \u00c0 \u0394 \u038E")
    u'A A \u0394 \u03a5'
    >>> 
    

    The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

    And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

    0 讨论(0)
  • 2020-11-21 06:49

    Some languages have combining diacritics as language letters and accent diacritics to specify accent.

    I think it is more safe to specify explicitly what diactrics you want to strip:

    def strip_accents(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT', 'COMBINING TILDE')):
        accents = set(map(unicodedata.lookup, accents))
        chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
        return unicodedata.normalize('NFC', ''.join(chars))
    
    0 讨论(0)
提交回复
热议问题