Python and character normalization

前端未结

关注

 4  2119

Hello I retrieve text based utf8 data from a foreign source which contains special chars such as u\"ıöüç\" while I want to normalize them to English such as

相关标签:

4条回答

挽巷

2020-12-01 09:56
I recommend using Unidecode module:
```
>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'
```
Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2020-12-01 10:02

The simplest way I found:

unicodedata.normalize('NFKD', s).encode("ascii", "ignore")

0 讨论(0)
发布评论:

提交评论
- 加载中...
北荒

2020-12-01 10:02
```
import unicodedata
unicodedata.normalize()
```
http://docs.python.org/library/unicodedata.html
0 讨论(0)
发布评论:

提交评论
- 加载中...
你的背包

2020-12-01 10:12
It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.

If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").
```
import unicodedata

def remove_nonspacing_marks(s):
    "Decompose the unicode string s and remove non-spacing marks."
    return ''.join(c for c in unicodedata.normalize('NFKD', s)
                   if unicodedata.category(c) != 'Mn')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...