How much UTF-8 text fits in a MySQL “Text” field?

后端 未结 3 1955
旧巷少年郎
旧巷少年郎 2020-12-08 02:24

According to MySQL, a text column holds 65,535 bytes.

So if this a legitimate boundary then will it actually only fit about 32k UTF-8 characters, right?

3条回答
  •  有刺的猬
    2020-12-08 03:00

    UTF-8 characters can take up to 4 bytes each, not 2 as you are supposing. UTF-8 is a variable-width encoding, depending on the number of significant bits in the Unicode code point:

    • 7 bits and under in the Unicode code point: 1 byte in UTF-8
    • 8 to 11 bits: 2 bytes in UTF-8
    • 12 to 16 bits: 3 bytes
    • 17 to 21 bits: 4 bytes

    The original UTF-8 spec allows encoding up to 31-bit Unicode values, taking as many as 6 bytes to encode in UTF-8 form. After UTF-8 became popular, the Unicode Consortium declared that they will never use code points beyond 221 - 1. This is now standardized as RFC 3629.

    MySQL currently (i.e. version 5.6) only supports the Unicode Basic Multilingual Plane characters, for which UTF-8 needs up to 3 bytes per character. That means the current answer to your question is that your TEXT field can hold at least 21,844 characters.

    Depending on how you look at it, the actual limits are higher or lower than that:

    • If you assume, as I do, that the BMP limitation will eventually be lifted in MySQL or one of its forks, you shouldn't count on being able to store more than 16,383 characters in that field if your MySQL client allows arbitrary Unicode text input.

    • On the other hand, you may be able to exploit the fact that UTF-8 is a variable width encoding. If you know your text is mostly plain English with just the occasional non-ASCII character, your effective in-practice limit could approach the maximum 64 KB - 1 character limit.

提交回复
热议问题