PHP source code in UTF-8 files; how to interpret properly?

I build tools to analyze source code. Such tools have to read the source code files correctly, especially as regards character encodings. For example, "What is the precise string of bytes in a string literal?" (both PHP literals, and HTML text).

My perhaps erroneous understanding is that PHP source files are 8-bit character only (that is, the PHP engine reads them that way [right]?, since they are only supposed to contain 8 bit characters). But, eight bit characters in which encoding? (I presume intended to match ISO-8859-1 (-x?) [can somebody quote chapter and verse?]. That is, an umlaut is intended to be an umlaut, right? Following this, one can write PHP scripts with HTML and strings for most European nations/character sets straightforwardly.

But it is clear this is problematic with Unicode. As far as I can tell, most PHP applications deal with Unicode essentially by having strings containing UTF-8 byte sequences which can be inserted in 8-bit PHP strings. Following this, one can generate scripts whose HTML contains Unicode UTF-8 sequences, if you tell your server you are generating UTF-8 text.

For the above situations, one can read the PHP file as 8-bit character text, and this seems to me to match the language.

What puzzles me are PHP source files encoded as UTF-8 (the Joomla package has ~1800 source files, of which some 10 are UTF-8 and the rest are not). Any (non-ASCII) European characters that show correctly in a UTF-8 rendering are actually encoded as multibyte sequences. I suppose such pages served as UTF-8 will have the HTML rendered correctly. But any string comparisons for European characters or other Unicode characters that apparently render correctly in a text editor simply won't work. And string literals will not contain what they appear to contain. Do programmers use UTF-8 files because that's what editors offer? Are they doing this on purpose? Or is just an accident that doesn't matter for most work?

So, how should one read a PHP source file? (in particular, in what character encoding?) One possible answer is, always as ISO-8859-1 8 bit codes, regardless of the actual content or BOMs (I see a lot UTF-8 BOM-marked PHP files). Another answer is as UTF-8, if so marked.

[Our tools read and write arbitrary encodings. A "trivial" tool is read-file-in-one-character encoding, write identical code points in another encoding. Reading UTF-8 PHP files that way, gets us into trouble writing ISO8859-1 equivalent files, because many UTF-8 code points (e.g., the euro symbol) cannot be encoded in ISO8859-x.]

EDIT Aug 30: We now check PHP files to see if the have UTF-8 BOMs, or appear to have UTF-8 sequences that are all legal. In either of these cases, we read the file as UTF-8; otherwise we read it as ISO8859-1 by default. We now preserve the file encoding if we modify it. (Getting all this right is quite a lot of work). This seems to be a safe strategy, but that may be different than what PHP programmers are expecting.

TL;DR

ASCII

Until PHP 5.4, the PHP interpreter didn't at all care about the charset of PHP files, as evidenced by the fact that the zend.script_encoding ini directive only appeared in that version. It always treated it as ASCII basically.

When PHP needs to identify, for example, a function name, that happens to contain characters beyond ASCII-7bit (well, any labeled entity with any label really, but you get my point...), it merely looks for a function in the symbol table with the same byte sequence - an umlaut (or whatever...) written in one way would be treated differently than an umlaut written in another way. Try it. For backwards compatibility, if zend.script_encoding is not set, this is still the default behavior. Also take note of the regex showing what is a valid identifier, which you can see is charset neutral (well... except latin letters, which are in the ASCII-7bit range), but shows you bytes instead.

This leads us also to the declare(encoding) construct. If you see THAT in a file, that's the definitive charset to honor for that particular file (ONLY). Use something else until you encounter one, and if you see more than one - honor the second one after its declare statement.

If there's none...

In a static context (i.e. when you don't know the effective ini settings), you'd need to fallback to something else (something that's user defined, ideally) when the charset is important, or otherwise just treat characters beyond ASCII-7bit as pure binary, and display them in some uniform code-point-like fashion.

In a dynamic context (e.g. if you could for example rename the file for a moment, create a temporary file at that place, with that name; have it echo the value of zend.script_encoding; restore back the normal file), you should use the zend.script_encoding value if available, and fallback to something else (just as in a static context) otherwise.

The same treatment applies to strings, HTML fragments and any other contents of a PHP file - it's just read as a binary string, except certain ASCII characters (i.e. bytes) that are important to the PHP lexer, such as the sequence "<?php" (notice that all are ASCII characters...); an apostrophe within a single quoted string; etc. - The interpreter itself doesn't care about a string's charset, and if you must display a string's contents on screen, you should use the above means to figure out the best way to do so.

Edge cases (requested in comments):

Is there a restriction on what encoding are allowed?

There doesn't seem to be any list of allowed encodings anywhere, or at least I can't find one. Given that this is the successor of the --enable-zend-multibyte compile setting, UTF encodings of all flavors are sure to be in that list. Even if other (ANSI) encodings don't have an effect on PHP itself, that shouldn't deter you from using that value as a hint.
How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)?

zend.script_encoding is used until a declare(encoding) is encountered. If it's not set, ASCII is assumed. This shouldn't be a problem even in a UTF-16 file... right? (I don't use UTF-16)
If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up?

I haven't tried supplying invalid UTF-8 bytes to tell you the answer to that one, nor does the manual ever state anything on the question. I would assume that PHP execution will fail with a parse error on that. Or at least it should. As far as your tool is concerned, it should report the invalid UTF-8 sequence anyway, since even if PHP allows it, that's still a QA problem.
For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)?

No. Characters in strings and non-PHP content are still treated as just a sequence of bytes, which you can confirm by looking at the output of strlen(), and seeing how it differs from mb_strlen(), which is the one that respects encoding (well... it respects the mbstring.internal_encoding setting to be exact, but still).
If not, what does it mean to set the encoding to UTF something?

AFAIK, it affects lookups in the symbol table. With UTF set, umlauts written in different ways, or in different UTF flavors that end up with the same UTF code points... they would all converge on the same symbol, as opposed to without declare(encoding), where byte-by-byte comparrison is done instead. And I say "AFAIK" here, because frankly, I've never used such experiments myself... I'm a "do gooddy 'everything-as-valid-UTF-8'-er".

Lorenz Meyer

As repeated already many times, PHP files do not have any encoding for bytes above x7f. All you can tell is that the bytes x00 to x7f are ascii.

A file with a BOM marker at the beginning is not valid PHP. So there is nothing like a PHP file in iso-8859-1 or utf-8. It is plain 8-bit.

A PHP file is not iso-8859-x, because those encodings do not contain all possible byte values. As you know x7f to x9f are not valid in iso-8859-1, but any PHP file can possibly contain them.

A PHP file is not utf-8 either, because it might contain invalid utf-8 sequences, without being invalid.

The big picture

Charset by convention at writing

A PHP file can have an encoding by convention, but this is up to the discretion of the programmer. He will tell his editor, that such project is in utf-8 or iso-8859-1 or what else.

But again, this is only a convention of the programmer. His editor is threating the PHP file as if it were in such and such encoding. The encoding is merely serving the purpose of displaying the file in the editor and allows the programmer to edit it.

No charset during compilation

As explained above, the compiler does not need to know the encoding the programmer assumed. The only thing that matters is what are the byte sequences in the file.

Implicit or explicit charset defined on consumption

PHP generates some data that is sent over internet to the browser. At the time the browser displays the data, the encoding is definitely defined, but how ?

The encoding can be defined in the HTTP header, like this Content-Type: text/html; charset=utf-8
It can be defined in the HTML output itself: <meta charset="utf-8">
Or if the charset is not defined explicitely, the browser makes an educated guess depending on the byte sequences present in the document (e.g. valid utf-8 sequences or BOM).

Of course it is good practice that an PHP application never lets the browser choose, but there is no requirement that the encoding be defined anywhere.

More details

Normally, the encoding the programmer chooses will be the same which will be used at the end of the chain in the browser, and all strings in the PHP-files will use this same encoding.

But this needs not be the case. There are valid reasons, why this will not be the case. Let's look at examples:

Different languages, different encodings

I use Joomla since it's version 1.0. In this version, the language files had each their own encoding. The french language was iso-8859-1, while the arab files were windows-1256 and russian files koi8-r. For those encoding mattered, but not for all other files, which could be treated equally as utf-8 or iso-5598-1. (Meanwhile, Joomla switched to utf-8.)

Heterogeneous databases

One of our web application connects to two different databases, one happens to be in utf-8, the other one in windows-1252. This means, that all the strings in this project are not in the same encoding. I use utf-8 as much as possible, but I need to thanslate the encodings back and forth using the mb_*group of functions in PHP.

PHP's conversion functions

Merely the presence of the encoding conversion functions mb_convert_encoding, iconv, utf8_encode, etc. suggests that in the same project string of different encodings can be present.

Good practice

Define your encoding and stick to it ! The best choice will be the use of utf-8. If other strings of other encodings are needed, you can always write something like $s=mb_convert_encoding('Уровень','ucs-2','utf8');

Here again: You cannot use BOM markers in PHP. The reason is simple: A BOM marker ar two bytes that come before the opening tag <?php. They are therefore sent to the browser. If one tries to send a header() afterwards, an error is generated, and the header is not sent.

Conclusion

In general, there is no need to determine the encoding of a PHP file. Only the encoding of the finally rendered HTML-file is important.
It is good practice to edit all files in the same encoding that is used to display the final results. But it really only matters for the language files (if you use any system of i18n at all).
While in practice all the strings in one file are in the same encoding, nothing would keep an ill minded programmer to write strings in different encodings in the same file, and still get a working program.

Finally encoding in PHP is only a matter of convention used at writing time, and the charset used in the browser to render the page. In between, a PHP file has no specific encoding, it's just plain 8-bit.

Pekka 웃

There is really no way to reliably tell a PHP source file's encoding. It could be anything really. As you know, the only generic identifier is the BOM, but most people will remove those from their source files as they can cause trouble at output time.

How to deal with this depends on what you want to do. Usually, it doesn't matter because the PHP file will take care of declaring its encoding itself, e.g. by sending a Content-type header (or it is defined implicitly, e.g. because it's part of a project whose convention it is to use a certain encoding). The issue of encoding doesn't really come up because the file sorts it out itself at execution time.

If you're building a tool that manipulates or analyzes PHP source files in some form, chances are the encoding doesn't really matter, but we'd have to know more about your situation to assess that.

The way most IDEs deal with this uncertainty is they ask the developer to manually specify which encoding the project, folder, and / or file are in. Maybe that is an option for you as well.

来源：https://stackoverflow.com/questions/17872046/php-source-code-in-utf-8-files-how-to-interpret-properly

标签

php

utf-8

code-analysis

iso-8859-1