These past few days I\'ve been working toward converting my PHP code base from latin1 to UTF-8. I\'ve read the two main solutions are to either replace the single byte funct
They aren't "necessary" unless you're using any of the functions they replace (and it's likely that you are using at least one of these) or otherwise explicitly need a feature of the extension such as HTTP handling.
When working towards UTF-8 compliance, I always fall back to the PHP UTF-8 Cheatsheet with one addition: PCRE patterns need to be updated to use the u
modifier.
You could use the mbfunctions library that extends the multibyte functions in PHP:
http://code.google.com/p/mbfunctions/
thomasrutter indicates that the search does not need special handling. For example, if you need to check the length of an UTF8 string, I don't see how you can do that using plain strlen()
.
Functions such as mb_strtoupper may be necessary, too. strtoupper won't convert á to Á.
As far as I understand the issue, as long as all your data is 100% in utf-8 - and that means user input, database, and also the encoding of the PHP files themselves if you have special characters in them - this is true true for search and comparison operations. As @ntd points out, a non-multibyte strlen() will produce wrong results when run on a string that contains multibyte characters.
THis is a great article on the basics of encoding.
There are a number of functions that expect strings to be single byte (And some even presume that it is iso-8859-1). In these cases, you need to be aware of what you're doing and possibly use replacement functions. There is a fairly comprehensive list at: http://www.phpwact.org/php/i18n/utf-8