MySQL - select first 10 bytes of a string

问题

Hello wise men & women,

How would you select the first x bytes of a string?

The use case: I'm optimizing product description texts for upload to Amazon, and Amazon measures field lengths by bytes in utf8 (not latin1 as I stated earlier), not by characters. MySQL on the other hand, seems to operate character-based. (e.g., the function left() is character-based, not byte-based). The difference (using English, French, Spanish & German) is roughly 10%, but it can vary widely.

Some tests concerning a field with #bytes < 250 (details: http://wiki.devliegendebrigade.nl/Format_inventarisbestanden_(Amazon)#Veldlengte):

OK, char_length: 248,   byte length latin1: 248,   byte length utf8: 248
OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 249
OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 249
OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 249

Not OK, char_length: 250,   byte length latin1: 250,   byte length utf8: 250
Not OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 252
Not OK, char_length: 248,   byte length latin1: 248,   byte length utf8: 252
Not OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 252
Not OK, char_length: 249,   byte length latin1: 249,   byte length utf8: 257

Illustration:

set @tekst="Jantje zag € pruimen hangen";

select
   char_length(@tekst),   # 27 characters
   length(@tekst);        # 29 bytes

select left(@tekst, 15)   # Result: "Jantje zag € pr"

# Ideally, I'm looking for something like this:

select left_bytes_utf8(@tekst, 15)   # Result: "Jantje zag € "

One approach would maybe be through a sproc that iteratively calls itself, but I suspect there are more efficient solutions around.

Thanks already & regards, Jeroen

P.s.: Edited the question: Changed 2x "latin1" to "utf8". It's actually a bit more confusing: Uploads should be in Latin1, but field sizes are measured in bytes using utf8

P.p.s: Update: These uploads are for English, French, Spanish & German Amazon-sites. Characters don't get more exotic than 'ø' (diameter), '€', 'è', 'é', 'ü' and 'ö'. All within Latin1-encoding, but multibyte in utf8.

回答1:

SELECT CONVERT(LEFT(CONVERT(@tekst USING binary), 15) USING utf8);

will give you the UTF-8 string cut down to 15 bytes as long as it is still a valid UTF-8 string (MySQL will refuse to give you an invalid string, for example if you cut on a multibyte character, and give you NULL instead.) If that doesn't work, you can get the raw bytes by omitting that last re-conversion to UTF-8, but you will have to decode them to something useful yourself:

SELECT LEFT(CONVERT(@tekst USING binary), 15);

However, Rick James gives a lot of good advice; though only you can judge the degree it is relevant to you, and your particular situation.

回答2:

How would you select the first x bytes of a string?

Is that really what you want to do? That could (as already pointed out) mangle the string by splitting a multi-byte character up into garbage.

Amazon calculates field lengths by bytes

Please provide evidence to this effect.

The difference is roughly 10%, but it can vary widely.

The max can be a factor of 4. Emoji and certain Chinese characters need 4 bytes for UTF-8 (utf8mb4) encoding.

If Amazon is encoding things in latin1 (which is not the same as "by bytes"), then first you need to check whether the string can be encoded in latin1. Western European text can be, but Asian text cannot be. Sure, you can get "bytes", that that leads to mangled text, especially if you truncate to some byte, but not character, boundary.

SELECT CONVERT(CONVERT(@tekst USING latin1) USING utf8) = @tekst;

Will return 1 (true) if the conversion will work.

Then you can use CONVERT(@tekst USING latin1) with LEFT(..., 10) or whatever.

Better?

If Amazon is effectively using latin1, then you use latin1. That is, declare your string:

 for_amazon VARCHAR(10) CHARACTER SET latin1

and/or connect with SET NAMES latin1

or you could have a bigger field, then do LEFT(..., 10)

Either will provide the conversion (before storing versus while fetching) so that the bytes you provide to Amazon will be latin1.

Caveat: If you store Chinese (or Russian or Greek, etc) in the column, it will be messed up.

回答3:

Thank you @Amadan & @Rick James! Thanks to your input, I was able to come up with a multibyte-safe byte-wise left function:

CREATE DEFINER=`root`@`localhost` FUNCTION `left_byte`(
    input_string text,
    input_position integer
) RETURNS text CHARSET utf8
BEGIN

# Byte-wise left function
################################################################################
#
# * multibyte-safe for characters of up to 4 bytes (=max # bytes utf8)
# * utf8 Assumed to be the general encoding

return 
ifnull
(
    ifnull
    (
        ifnull
        (
            convert(left(convert(input_string using binary), input_position) using utf8),
            convert(left(convert(input_string using binary), input_position-1) using utf8)
        ),
        convert(left(convert(input_string using binary), input_position-2) using utf8)
    ),
    convert(left(convert(input_string using binary), input_position-3) using utf8)
);    
END

来源：https://stackoverflow.com/questions/51517927/mysql-select-first-10-bytes-of-a-string

标签

mysql

stored-procedures

character-encoding

string-length

iso-8859-1