I have a Unicode string in Python, and I would like to remove all the accents (diacritics).
I found on the web an elegant way to do this (in Java):
I just found this answer on the Web:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
only_ascii = nfkd_form.encode('ASCII', 'ignore')
return only_ascii
It works fine (for French, for example), but I think the second step (removing the accents) could be handled better than dropping the non-ASCII characters, because this will fail for some languages (Greek, for example). The best solution would probably be to explicitly remove the unicode characters that are tagged as being diacritics.
Edit: this does the trick:
import unicodedata
def remove_accents(input_str):
nfkd_form = unicodedata.normalize('NFKD', input_str)
return u"".join([c for c in nfkd_form if not unicodedata.combining(c)])
unicodedata.combining(c)
will return true if the character c
can be combined with the preceding character, that is mainly if it's a diacritic.
Edit 2: remove_accents
expects a unicode string, not a byte string. If you have a byte string, then you must decode it into a unicode string like this:
encoding = "utf-8" # or iso-8859-15, or cp1252, or whatever encoding you use
byte_string = b"café" # or simply "café" before python 3.
unicode_string = byte_string.decode(encoding)
In response to @MiniQuark's answer:
I was trying to read in a csv file that was half-French (containing accents) and also some strings which would eventually become integers and floats.
As a test, I created a test.txt
file that looked like this:
Montréal, über, 12.89, Mère, Françoise, noël, 889
I had to include lines 2
and 3
to get it to work (which I found in a python ticket), as well as incorporate @Jabba's comment:
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
import csv
import unicodedata
def remove_accents(input_str):
nkfd_form = unicodedata.normalize('NFKD', unicode(input_str))
return u"".join([c for c in nkfd_form if not unicodedata.combining(c)])
with open('test.txt') as f:
read = csv.reader(f)
for row in read:
for element in row:
print remove_accents(element)
The result:
Montreal
uber
12.89
Mere
Francoise
noel
889
(Note: I am on Mac OS X 10.8.4 and using Python 2.7.3)