Truncating unicode so it fits a maximum size when encoded for wire transfer

前端 未结 5 1497
借酒劲吻你
借酒劲吻你 2020-12-29 21:57

Given a Unicode string and these requirements:

  • The string be encoded into some byte-sequence format (e.g. UTF-8 or JSON unicode escape)
  • The encoded st
5条回答
  •  一整个雨季
    2020-12-29 22:05

    One of UTF-8's properties is that it is easy to resync, that is find the unicode character boundaries easily in the encoded bytestream. All you need to do is to cut the encoded string at max length, then walk backwards from the end removing any bytes that are > 127 -- those are part of, or the start of a multibyte character.

    As written now, this is too simple -- will erase to last ASCII char, possibly the whole string. What we need to do is check for no truncated two-byte (start with 110yyyxx) three-byte (1110yyyy) or four-byte (11110zzz)

    Python 2.6 implementation in clear code. Optimization should not be an issue -- regardless of length, we only check the last 1-4 bytes.

    # coding: UTF-8
    
    def decodeok(bytestr):
        try:
            bytestr.decode("UTF-8")
        except UnicodeDecodeError:
            return False
        return True
    
    def is_first_byte(byte):
        """return if the UTF-8 @byte is the first byte of an encoded character"""
        o = ord(byte)
        return ((0b10111111 & o) != o)
    
    def truncate_utf8(bytestr, maxlen):
        u"""
    
        >>> us = u"ウィキペディアにようこそ"
        >>> s = us.encode("UTF-8")
    
        >>> trunc20 = truncate_utf8(s, 20)
        >>> print trunc20.decode("UTF-8")
        ウィキペディ
        >>> len(trunc20)
        18
    
        >>> trunc21 = truncate_utf8(s, 21)
        >>> print trunc21.decode("UTF-8")
        ウィキペディア
        >>> len(trunc21)
        21
        """
        L = maxlen
        for x in xrange(1, 5):
            if is_first_byte(bytestr[L-x]) and not decodeok(bytestr[L-x:L]):
                return bytestr[:L-x]
        return bytestr[:L]
    
    if __name__ == '__main__':
        # unicode doctest hack
        import sys
        reload(sys)
        sys.setdefaultencoding("UTF-8")
        import doctest
        doctest.testmod()
    

提交回复
热议问题