php mb_strlen return value is weird [closed]

问题

gb2312 is a double byte character set, using mb_strlen() to check a single chinese character will return 2, but for 2 more characters,sometimes the result is weird, anybody know why? how can I get the right length?

<?php
header('Content-type: text/html;charset=utf-8');//
$a="大";
echo mb_strlen($a,'gb2312'); // output 2
echo mb_strlen($a.$a,'gb2312'); // output 3 , it should be 4
echo mb_strlen($a.'a','gb2312'); // output 2, it should be 3
echo mb_strlen('a'.$a,'gb2312'); // output 3, 
?>

thanks deceze, your document is very helpful, people know little about encoding like me should read it.What every programmer absolutely, positively needs to know about encodings and character sets to work with text

回答1:

Your string is probably stored as UTF-8.

The UTF-8 code for "大" is E5 A4 A7 (according to this webpage), so:

$a       // 3 bytes, gb2312 -> 2 char (1 + 0.5)
$a . $a  // 6 bytes, gb2312 -> 3 char
$a . 'a' // 4 bytes, gb2312 -> 2 char
'a' . $a // 4 bytes, first byte is <128 so will be interpreted as one
         // single character, gb2312 -> 3 char

This is just a guess, but perfectly make sense to me if thinking this way. You can probably refer to this wikipedia page.

If you really want to test, I recommend you to create a separated file saved in gb2312 encoding, and use fopen or whatever to read it. Then you will be sure that it is in the desired encoding.

回答2:

Try setting the MB internal encoding to UTF-8

/* Set internal character encoding to UTF-8 */
mb_internal_encoding("UTF-8");

http://www.php.net/manual/en/function.mb-internal-encoding.php

回答3:

i think you have to use utf-8 instead gb2312

try this:

<?php
header('Content-type: text/html;charset=utf-8');//
$a="大";
echo mb_strlen($a,'utf8'); // output 1
echo mb_strlen($a.$a,'utf8'); // output 2 
echo mb_strlen($a.'a','utf8'); // output 2
echo mb_strlen('a'.$a,'utf8'); // output 2, 
?>

回答4:

By writing $a = "大"; into a PHP file, the variable $a contains a byte sequence of whatever was between the quotes in your source code file. If that source code file was saved in UTF-8, the string is a UTF-8 byte sequence representing the character "大". If the source code file was saved in GB2312, it is the GB2312 byte sequence representing "大". But a PHP file saved in GB2312 won't actually parse as valid PHP, since PHP needs an ASCII compatible encoding.

mb_strlen is supposed to give you the number of characters in the given string in the specified encoding. I.e. mb_strlen('大', 'gb2312') expects the string to be a GB2312 byte sequence representation and is supposed to return 1. You're wrong in expecting it to return 2, even if GB2312 is a double byte encoding. mb_strlen returns the number of characters.

strlen('大') would give you the number the bytes, because it's a naïve old-style functions which doesn't know anything about encodings and only counts bytes.

The bottom-line being: your expectation was wrong, and you have a mismatch between what the "大" is actually encoded in (whatever you saved your source code as) and what you tell mb_strlen it is encoded in (gb2312). Therefore mb_strlen cannot do its job correctly and gives you varying random results.

来源：https://stackoverflow.com/questions/13015317/php-mb-strlen-return-value-is-weird

标签

php

strlen