S3 -> Redshift cannot handle UTF8

后端 未结 5 657
野的像风
野的像风 2020-12-11 03:34

We have a file in S3 that is loaded in to Redshift via the COPY command. The import is failing because a VARCHAR(20) value contains an Ä

相关标签:
5条回答
  • 2020-12-11 03:48

    Please check below link

    http://docs.aws.amazon.com/redshift/latest/dg/multi-byte-character-load-errors.html

    You should use ACCEPTINVCHARS in you copy command. Details here

    http://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html#acceptinvchars

    0 讨论(0)
  • 2020-12-11 03:53

    tl;dr

    the byte length for your varchar column just needs to be larger.

    Detail

    Multi-byte characters (UTF-8) are supported in the varchar data type, however the length that is provided is in bytes, NOT characters.

    AWS documentation for Multibyte Character Load Errors states the following:

    VARCHAR columns accept multibyte UTF-8 characters, to a maximum of four bytes.

    Therefore if you want the character Ä to be allowed, then you need to allow 2 bytes for this character, instead of 1 byte.

    AWS documentation for VARCHAR or CHARACTER VARYING states the following:

    ... so a VARCHAR(120) column consists of a maximum of 120 single-byte characters, 60 two-byte characters, 40 three-byte characters, or 30 four-byte characters.

    For a list of UTF-8 characters and their byte lengths, this is a good reference: Complete Character List for UTF-8

    Detailed information for the Unicode Character 'LATIN CAPITAL LETTER A WITH DIAERESIS' (U+00C4) can be found here.

    0 讨论(0)
  • 2020-12-11 03:54

    I have a similar experience which some only characters like Ä were not copied correctly when loading mysqldump data into our Redshift cluster. It was because the encoding of mysqldump was latin1 which is the default character set of mysql. It's better to check the character encoding of the files for COPY first. If the encoding of your files are not UTF-8, you have to encode your files.

    0 讨论(0)
  • 2020-12-11 03:57

    You need to increase the size of your varchar column. Check the stl_load_errors table, see what is the actual field value length for the failed rows and accordingly increase the size. EDIT: Just realized this is a very old post, anyways if someone need it..

    0 讨论(0)
  • 2020-12-11 03:59

    Using "ACCEPTINVCHARS ESCAPE" in the copy command solved the issue for us with minor data alteration.

    0 讨论(0)
提交回复
热议问题