How do I get the “visible” length of a combining Unicode string in Python?

后端 未结 3 1791
被撕碎了的回忆
被撕碎了的回忆 2021-01-04 10:47

If I have a Python Unicode string that contains combining characters, len reports a value that does not correspond to the number of characters \"seen\".

<
相关标签:
3条回答
  • 2021-01-04 11:03

    Combining characters are not the only zero-width characters:

    >>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
    1
    

    ("\u200c" or "‌" is zero-width non-joiner; it's a non-printing character.)

    In this case the regex module does not work either:

    >>> len(regex.findall(r'\X', u'\u200c'))
    1
    

    I found wcwidth that handles the above case correctly:

    >>> from wcwidth import wcswidth
    >>> wcswidth(u'A\u0332\u0305BC')
    3
    >>> wcswidth(u'\u200c')
    0
    

    But still doesn't seem to work with user 596219's example:

    >>> wcswidth('각')
    4
    
    0 讨论(0)
  • 2021-01-04 11:08

    If you have a regex flavor that supports matching grapheme, you can use \X

    Demo

    While the default Python re module does not support \X, Matthew Barnett's regex module does:

    >>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
    3
    

    On Python 2, you need to use u in the pattern:

    >>> regex.findall(u'\\X', u'A\u0332\u0305BC')
    [u'A\u0332\u0305', u'B', u'C']
    >>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
    3
    
    0 讨论(0)
  • 2021-01-04 11:16

    The unicodedata module has a function combining that can be used to determine if a single character is a combining character. If it returns 0 you can count the character as non-combining.

    import unicodedata
    len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
    

    or, slightly simpler:

    sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
    
    0 讨论(0)
提交回复
热议问题