How do I get the “visible” length of a combining Unicode string in Python?

后端未结

关注

 3  1791

If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters \"seen\".

相关标签:

3条回答

你的背包

2021-01-04 11:03
Combining characters are not the only zero-width characters:
```
>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1
```
("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

In this case the regex module does not work either:
```
>>> len(regex.findall(r'\X', u'\u200c'))
1
```
I found wcwidth that handles the above case correctly:
```
>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0
```
But still doesn't seem to work with user 596219's example:
```
>>> wcswidth('각')
4
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
天涯浪人

2021-01-04 11:08
If you have a regex flavor that supports matching grapheme, you can use \X

Demo

While the default Python re module does not support \X, Matthew Barnett's regex module does:
```
>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3
```
On Python 2, you need to use u in the pattern:
```
>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
闹比i

2021-01-04 11:16
The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.
```
import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
```
or, slightly simpler:
```
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...