Replace special characters with ASCII equivalent

前端 未结 6 1812
星月不相逢
星月不相逢 2020-12-08 10:16

Is there any lib that can replace special characters to ASCII equivalents, like:

\"Cześć\"

to:

\"Czesc\"

相关标签:
6条回答
  • 2020-12-08 10:23

    I did it this way:

    POLISH_CHARACTERS = {
        50309:'a',50311:'c',50329:'e',50562:'l',50564:'n',50099:'o',50587:'s',50618:'z',50620:'z',
        50308:'A',50310:'C',50328:'E',50561:'L',50563:'N',50067:'O',50586:'S',50617:'Z',50619:'Z',}
    
    def encodePL(text):
        nrmtxt = unicodedata.normalize('NFC',text)
        i = 0
        ret_str = []
        while i < len(nrmtxt):
            if ord(text[i])>128: # non ASCII character
                fbyte = ord(text[i])
                sbyte = ord(text[i+1])
                lkey = (fbyte << 8) + sbyte
                ret_str.append(POLISH_CHARACTERS.get(lkey))
                i = i+1
            else: # pure ASCII character
                ret_str.append(text[i])
            i = i+1
        return ''.join(ret_str)
    

    when executed:

    encodePL(u'ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ')
    

    it will produce output like this:

    u'acelnoszz ACELNOSZZ'
    

    This works fine for me - ;D

    0 讨论(0)
  • 2020-12-08 10:26

    Try the trans package. Looks very promising. Supports Polish.

    0 讨论(0)
  • 2020-12-08 10:31
    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import unicodedata
    text = u'Cześć'
    print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')
    
    0 讨论(0)
  • 2020-12-08 10:44

    The package unidecode worked best for me:

    from unidecode import unidecode
    text = "Björn, Łukasz and Σωκράτης."
    print(unidecode(text))
    # ==> Bjorn, Lukasz and Sokrates.
    

    You might need to install the package:

    pip install unidecode
    

    The above solution is easier and more robust than encoding (and decoding) the output of unicodedata.normalize(), as suggested by other answers.

    # This doesn't work as expected:
    ret = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
    print(ret)
    # ==> b'Bjorn, ukasz and .'
    # Besides not supporting all characters, the returned value is a
    # bytes object in python3. To yield a str type:
    ret = ret.decode("utf8") # (not required in python2)
    
    0 讨论(0)
  • 2020-12-08 10:48

    You can get most of the way by doing:

    import unicodedata
    
    def strip_accents(text):
        return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')
    

    Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:

    • Æ → AE
    • Ð → D
    • Ø → O
    • Þ → TH
    • ß → ss
    • æ → ae
    • ð → d
    • ø → o
    • þ → th
    • Œ → OE
    • œ → oe
    • ƒ → f
    0 讨论(0)
  • 2020-12-08 10:48

    The unicodedata.normalize gimmick can best be described as half-assci. Here is a robust approach which includes a map for letters with no decomposition. Note the additional map entries in the comments.

    0 讨论(0)
提交回复
热议问题