Migrating a php application to handle UTF-8

混江龙づ霸主 提交于 2019-11-28 12:48:59
Danack

There's a little more to it than just replacing those functions.

Regular expressions

You should add the utf8 flag to all of your PCRE regular expressions that can have strings which contain non-Ascii chars, so that the patterns are interpreted as the actual characters rather than bytes.

$subject = "Helló";
$pattern = '/(l|ó){2,3}/u'; //The u flag indicates the pattern is UTF8
preg_match($pattern, substr($subject,3), $matches, PREG_OFFSET_CAPTURE);

Also you should use the Unicode character classes rather than the standard Perl ones if you want your regular expressions to be correct for non-Latin alphabets?

  • \p{L} instead of \w for any 'letter' character.
  • \p{Z} instead of \s for any 'space' character.
  • \p{N} instead of \d for any 'digit' character e.g. Arabic numbers

There are a lot of different Unicode character classes, some of which are quite unusual to someone used to reading and writing in a Latin alphabet. For example some characters combine with the previous character to make a new glyph. More explanation of them can be read here.

Although there are regular expression functions in the mbstring extension, they are not recommended for use. The standard PCRE functions work fine with the UTF8 flag.

Function replacements

Although your list is a start, the list of function I have found so far that need to be replaced with multibyte versions is longer. This is the list of functions with their replacement functions, some of which are not defined in PHP, but are available from here on Github as mb_extra.

$unsafeFunctions = array(
    'mail'      => 'mb_send_mail',
    'split'     => null, //'mb_split', deprecated function - just don't use it
    'stripos'   => 'mb_stripos',
    'stristr'   => 'mb_stristr',
    'strlen'    => 'mb_strlen',
    'strpos'    => 'mb_strpos',
    'strrpos'   => 'mb_strrpos',
    'strrchr'   => 'mb_strrchr',
    'strripos'  => 'mb_strripos',
    'strstr'    => 'mb_strstr',
    'strtolower'    => 'mb_strtolower',
    'strtoupper'    => 'mb_strtoupper',
    'substr_count'  => 'mb_substr_count',
    'substr'        => 'mb_substr',
    'str_ireplace'  => null,
    'str_split'     => 'mb_str_split', //TODO - check this works
    'strcasecmp'    => 'mb_strcasecmp', //TODO - check this works
    'strcspn'       => null, //TODO - implement alternative
    'strrev'        => 'mb_strrev', //TODO - check this works
    'strspn'        => null, //TODO - implement alternative
    'substr_replace'=> 'mb_substr_replace',
    'lcfirst'       => null,
    'ucfirst'       => 'mb_ucfirst',
    'ucwords'       => 'mb_ucwords',
    'wordwrap'      => null,
);

MySQL

Although you would have thought that setting the character type to utf8 would give you UTF-8 support in MySQL, it does not.

It only gives you support for UTF-8 that are encoded in up to 3 bytes aka the Basic Multi-lingual Plane. However people are actively using characters that require 4 bytes to encode, including most of the Emoji characters, also know as the Supplementary Multilingual Plane

To support these you should in general use:

  • utf8mb4 - for your character encoding.
  • utf8mb4_unicode_ci - for your character collation.

For specific scenarios there are alternative collation sets that may be appropriate for you, but in general stick to the collation set that is most correct.

The list of places where you should set the character set and collation in your MySQL config file are:

[mysql]
default-character-set=utf8mb4

[client]
default-character-set=utf8mb4

[mysqld]
init-connect='SET NAMES utf8mb4'
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci

The SET NAMES may not be required in all circumstances - but it is safer on at only a small speed penalty.

PHP INI File

Although you said you have set mb_internal_encoding in your bootstrap script, it is much better to do this in the PHP ini file, and also set all the recommended parameters:

mbstring.language   = Neutral   ; Set default language to Neutral(UTF-8) (default)
mbstring.internal_encoding  = UTF-8 ; Set default internal encoding to UTF-8
mbstring.encoding_translation = On  ;  HTTP input encoding translation is enabled
mbstring.http_input     = auto  ; Set HTTP input character set dectection to auto
mbstring.http_output    = UTF-8 ; Set HTTP output encoding to UTF-8
mbstring.detect_order   = auto  ; Set default character encoding detection order to auto
mbstring.substitute_character = none ; Do not print invalid characters
default_charset      = UTF-8 ; Default character set for auto content type header

Helping browser to choose UTF8 for forms

  • You need to set accept-charset on your forms to be UTF-8 to tell browsers to submit them as UTF8.

  • Add a UTF8 character to your form in a hidden field, to stop Internet Explorer (5, 6, 7 and 8) from submitting a form as something other than UTF8.

Misc

  • If you're using Apache set "AddDefaultCharset utf-8"

  • As you said you're doing, but just to remind anyone reading the answer, set the meta content-type as well in the header.

That should be about it. Although it's worth reading the "What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text" page, I think it is preferable to use UTF-8 everywhere and so not have to spend any mental effort on handling different character sets.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!