What is the best way to remove accents (normalize) in a Python unicode string?

后端 未结 8 1578
感情败类
感情败类 2020-11-21 06:11

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. conve
8条回答
  •  遇见更好的自我
    2020-11-21 06:39

    Actually I work on project compatible python 2.6, 2.7 and 3.4 and I have to create IDs from free user entries.

    Thanks to you, I have created this function that works wonders.

    import re
    import unicodedata
    
    def strip_accents(text):
        """
        Strip accents from input String.
    
        :param text: The input string.
        :type text: String.
    
        :returns: The processed String.
        :rtype: String.
        """
        try:
            text = unicode(text, 'utf-8')
        except (TypeError, NameError): # unicode is a default on python 3 
            pass
        text = unicodedata.normalize('NFD', text)
        text = text.encode('ascii', 'ignore')
        text = text.decode("utf-8")
        return str(text)
    
    def text_to_id(text):
        """
        Convert input text to id.
    
        :param text: The input string.
        :type text: String.
    
        :returns: The processed String.
        :rtype: String.
        """
        text = strip_accents(text.lower())
        text = re.sub('[ ]+', '_', text)
        text = re.sub('[^0-9a-zA-Z_-]', '', text)
        return text
    

    result:

    text_to_id("Montréal, über, 12.89, Mère, Françoise, noël, 889")
    >>> 'montreal_uber_1289_mere_francoise_noel_889'
    

提交回复
热议问题