Warning raised by inserting 4-byte unicode to mysql

前端 未结 3 1599
旧时难觅i
旧时难觅i 2020-12-02 00:16

Look at the following:

/home/kinka/workspace/py/tutorial/tutorial/pipelines.py:33: Warning: Incorrect string 
value: \'\\xF0\\x9F\\x91\\x8A\\xF0\\x9F...\' fo         


        
相关标签:
3条回答
  • 2020-12-02 00:45

    simple normalization for string without regex and translate:

    def normalize_unicode(s):
        return ''.join([ unichr(k) if k < 0x10000 else 0xfffd for k in [ord(c) for c in s]])
    
    0 讨论(0)
  • 2020-12-02 00:57

    If MySQL cannot handle UTF-8 codes of 4 bytes or more then you'll have to filter out all unicode characters over codepoint \U00010000; UTF-8 encodes codepoints below that threshold in 3 bytes or fewer.

    You could use a regular expression for that:

    >>> import re
    >>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
    >>> example = u'Some example text with a sleepy face: \U0001f62a'
    >>> highpoints.sub(u'', example)
    u'Some example text with a sleepy face: '
    

    Alternatively, you could use the .translate() function with a mapping table that only contains None values:

    >>> nohigh = { i: None for i in xrange(0x10000, 0x110000) }
    >>> example.translate(nohigh)
    u'Some example text with a sleepy face: '
    

    However, creating the translation table will eat a lot of memory and take some time to generate; it is probably not worth your effort as the regular expression approach is more efficient.

    This all presumes you are using a UCS-4 compiled python. If your python was compiled with UCS-2 support then you can only use codepoints up to '\U0000ffff' in regular expressions and you'll never run into this problem in the first place.

    I note that as of MySQL 5.5.3 the newly-added utf8mb4 codec does supports the full Unicode range.

    0 讨论(0)
  • 2020-12-02 01:07

    I think you should use utf8mb4 collation instead of utf8 and run

    SET NAMES UTF8MB4
    

    after connection with DB (link, link, link)

    0 讨论(0)
提交回复
热议问题