My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh)
and I want to decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh)
. I see in that site can do that ht
Try:
str.encode('ascii', 'ignore').decode('utf-8')
You're encoding the string in ASCII format / ignoring the errors and decoding in UTF-8. This may remove the accents, but it's one approach.
I'm not sure what you can do with these kind of data, but for your example in your original post, this works:
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh
The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy
This module fixes pretty much everything and works much better than online decoders.
>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'
It can be easily installed using pip install ftfy