Truncating unicode so it fits a maximum size when encoded for wire transfer

前端未结

关注

 5  1497

借酒劲吻你 2020-12-29 21:57

Given a Unicode string and these requirements:

The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
The encoded st

5条回答

一整个雨季 (楼主)

2020-12-29 22:05

One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.

As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)

Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless of length, we only check the last 1-4 bytes.

# coding: UTF-8

def decodeok(bytestr):
    try:
        bytestr.decode("UTF-8")
    except UnicodeDecodeError:
        return False
    return True

def is_first_byte(byte):
    """return if the UTF-8 @byte is the first byte of an encoded character"""
    o = ord(byte)
    return ((0b10111111 & o) != o)

def truncate_utf8(bytestr, maxlen):
    u"""

    >>> us = u"ウィキペディアにようこそ"
    >>> s = us.encode("UTF-8")

    >>> trunc20 = truncate_utf8(s, 20)
    >>> print trunc20.decode("UTF-8")
    ウィキペディ
    >>> len(trunc20)
    18

    >>> trunc21 = truncate_utf8(s, 21)
    >>> print trunc21.decode("UTF-8")
    ウィキペディア
    >>> len(trunc21)
    21
    """
    L = maxlen
    for x in xrange(1, 5):
        if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
            return bytestr[:L-x]
    return bytestr[:L]

if __name__ == '__main__':
    # unicode doctest hack
    import sys
    reload(sys)
    sys.setdefaultencoding("UTF-8")
    import doctest
    doctest.testmod()

0 讨论(0)

查看其它5个回答