Is there any lib that can replace special characters to ASCII equivalents, like:
\"Cześć\"
to:
\"Czesc\"
I did it this way:
POLISH_CHARACTERS = {
50309:'a',50311:'c',50329:'e',50562:'l',50564:'n',50099:'o',50587:'s',50618:'z',50620:'z',
50308:'A',50310:'C',50328:'E',50561:'L',50563:'N',50067:'O',50586:'S',50617:'Z',50619:'Z',}
def encodePL(text):
nrmtxt = unicodedata.normalize('NFC',text)
i = 0
ret_str = []
while i < len(nrmtxt):
if ord(text[i])>128: # non ASCII character
fbyte = ord(text[i])
sbyte = ord(text[i+1])
lkey = (fbyte << 8) + sbyte
ret_str.append(POLISH_CHARACTERS.get(lkey))
i = i+1
else: # pure ASCII character
ret_str.append(text[i])
i = i+1
return ''.join(ret_str)
when executed:
encodePL(u'ąćęłńóśźż ĄĆĘŁŃÓŚŹŻ')
it will produce output like this:
u'acelnoszz ACELNOSZZ'
This works fine for me - ;D
Try the trans package. Looks very promising. Supports Polish.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import unicodedata
text = u'Cześć'
print unicodedata.normalize('NFD', text).encode('ascii', 'ignore')
The package unidecode worked best for me:
from unidecode import unidecode
text = "Björn, Łukasz and Σωκράτης."
print(unidecode(text))
# ==> Bjorn, Lukasz and Sokrates.
You might need to install the package:
pip install unidecode
The above solution is easier and more robust than encoding (and decoding) the output of unicodedata.normalize()
, as suggested by other answers.
# This doesn't work as expected:
ret = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(ret)
# ==> b'Bjorn, ukasz and .'
# Besides not supporting all characters, the returned value is a
# bytes object in python3. To yield a str type:
ret = ret.decode("utf8") # (not required in python2)
You can get most of the way by doing:
import unicodedata
def strip_accents(text):
return ''.join(c for c in unicodedata.normalize('NFKD', text) if unicodedata.category(c) != 'Mn')
Unfortunately, there exist accented Latin letters that cannot be decomposed into an ASCII letter + combining marks. You'll have to handle them manually. These include:
The unicodedata.normalize gimmick can best be described as half-assci. Here is a robust approach which includes a map for letters with no decomposition. Note the additional map entries in the comments.