PHP Multibyte String Functions

做~自己de王妃 提交于 2019-12-29 08:46:09

问题


Today I ran into a problem with the php function strpos() because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not.

Now I have noticed that using the mb_strpos function solved my problem.

My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos, strlen, ereg, etc., etc. functions at all?

Notice: I don't want to set mbstring.func_overload global in php.ini, because this leads to other problems when using the PEAR library. I am using PHP4.


回答1:


It depends on the character encoding you are using. In single-byte character encodings, or UTF-8 (where a single byte inside a character can never be mistaken for another character), then as long as the string you are searching in and the string you are using to search are in the same encoding then you can continue to use the regular string search functions.

If you are using a multi-byte encoding other than UTF-8, which does not prevent single bytes within a character from appearing like other characters, then it is never safe to do a string search using the regular string search functions. You may find false positives. This is because PHP's string comparison in functions such as strpos is per-byte, and with the exception of UTF-8 which is specifically designed to prevent this problem, multi-byte encodings suffer the problem that any subsequent byte in a character made up of more than one byte may match part of a different character.

If the string you are searching in and the string you are searching for are of different character encodings, then conversion will always be necessary. Otherwise you'll find that for any string that would be represented differently in the other encoding, it will always return false. You should do such conversion on input: decide on a character encoding your app will use, and be consistent within the application. Any time you receive input in a different encoding, convert on the way in.




回答2:


There have been some problems with the mb_ * functions in PHP versions prior to 5.2. So if your code is going on multiple platforms with different versions of PHP, strange behavior can occur. Furthermore the mb_ strpos function is rather slow, it has to skip the number of characters specified by the offset parameter to get the real byte position used internally. In loops depending on the strpos/mb_strpos functionality this can become a major bottleneck.




回答3:


If you use the same encoding everywhere it generally isn't a problem. I use UTF-8 for all my pages, and have never actually encountered this problem. In the end it really comes down to specifying the same encoding for the pages and the database.

For example:

header('Content-type: text/html;charset=utf-8');
mysql_query('SET NAMES utf8');

In most cases this means that all the data sources for the application will deliver data in the same encoding, and thus you'll avoid this kind of problems.

This will all be much better with the advent PHP 6, btw, since it will include full unicode-support.




回答4:


You don't necessarily have to use mb_strpos, but you do need to make sure that all the data in your app is the same: either an mb_string, or a plain string in one particular encoding. (Usually UTF-8.)

If you make sure your pages are UTF-8, and your form submissions are interpreted as UTF-8, and your database stores UTF-8, you'll generally be OK. Indexed string operations (in particular truncations) can break a UTF-8 sequence, which is annoying but not generally disastrous. If you do need that level of support, mb_strings are your only option (but of course you have to make sure that all parts of your app and libraries and PHP version can cope with them properly).

Developing sites that handle Unicode correctly in PHP isn't too much fun right now: its Unicode support is very poor compared to languages like Python and .NET. It is hoped PHP6 will improve matters.




回答5:


I would recommend using the following PHP UTF-8 library:

http://sourceforge.net/projects/phputf8

Bundling it with your application loosens your application's requirements by not requiring the mbstring extension, but you still get UTF-8 string functions.



来源:https://stackoverflow.com/questions/661832/php-multibyte-string-functions

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!