PHP source code in UTF-8 files; how to interpret properly?

核能气质少年 提交于 2019-11-28 09:30:36

TL;DR

ASCII


Until PHP 5.4, the PHP interpreter didn't at all care about the charset of PHP files, as evidenced by the fact that the zend.script_encoding ini directive only appeared in that version. It always treated it as ASCII basically.

When PHP needs to identify, for example, a function name, that happens to contain characters beyond ASCII-7bit (well, any labeled entity with any label really, but you get my point...), it merely looks for a function in the symbol table with the same byte sequence - an umlaut (or whatever...) written in one way would be treated differently than an umlaut written in another way. Try it. For backwards compatibility, if zend.script_encoding is not set, this is still the default behavior. Also take note of the regex showing what is a valid identifier, which you can see is charset neutral (well... except latin letters, which are in the ASCII-7bit range), but shows you bytes instead.

This leads us also to the declare(encoding) construct. If you see THAT in a file, that's the definitive charset to honor for that particular file (ONLY). Use something else until you encounter one, and if you see more than one - honor the second one after its declare statement.

If there's none...

In a static context (i.e. when you don't know the effective ini settings), you'd need to fallback to something else (something that's user defined, ideally) when the charset is important, or otherwise just treat characters beyond ASCII-7bit as pure binary, and display them in some uniform code-point-like fashion.

In a dynamic context (e.g. if you could for example rename the file for a moment, create a temporary file at that place, with that name; have it echo the value of zend.script_encoding; restore back the normal file), you should use the zend.script_encoding value if available, and fallback to something else (just as in a static context) otherwise.

The same treatment applies to strings, HTML fragments and any other contents of a PHP file - it's just read as a binary string, except certain ASCII characters (i.e. bytes) that are important to the PHP lexer, such as the sequence "<?php" (notice that all are ASCII characters...); an apostrophe within a single quoted string; etc. - The interpreter itself doesn't care about a string's charset, and if you must display a string's contents on screen, you should use the above means to figure out the best way to do so.


Edge cases (requested in comments):

  1. Is there a restriction on what encoding are allowed?

    There doesn't seem to be any list of allowed encodings anywhere, or at least I can't find one. Given that this is the successor of the --enable-zend-multibyte compile setting, UTF encodings of all flavors are sure to be in that list. Even if other (ANSI) encodings don't have an effect on PHP itself, that shouldn't deter you from using that value as a hint.

  2. How does "declare(encoding)" work if the source file is UTF-16 (null 8 bit bytes between 8 bit ascii chars for the declaration)?

    zend.script_encoding is used until a declare(encoding) is encountered. If it's not set, ASCII is assumed. This shouldn't be a problem even in a UTF-16 file... right? (I don't use UTF-16)

  3. If the .ini or the file setting is UTF-8 or otherwise, then identifiers are presumably taken only from code points in range x41-xFF, but not from code points x100 up?

    I haven't tried supplying invalid UTF-8 bytes to tell you the answer to that one, nor does the manual ever state anything on the question. I would assume that PHP execution will fail with a parse error on that. Or at least it should. As far as your tool is concerned, it should report the invalid UTF-8 sequence anyway, since even if PHP allows it, that's still a QA problem.

  4. For UTF encodings, are characters in strings represented as their UTF code point (that makes no sense since PHP strings seem only have 8 bit characters)?

    No. Characters in strings and non-PHP content are still treated as just a sequence of bytes, which you can confirm by looking at the output of strlen(), and seeing how it differs from mb_strlen(), which is the one that respects encoding (well... it respects the mbstring.internal_encoding setting to be exact, but still).

  5. If not, what does it mean to set the encoding to UTF something?

    AFAIK, it affects lookups in the symbol table. With UTF set, umlauts written in different ways, or in different UTF flavors that end up with the same UTF code points... they would all converge on the same symbol, as opposed to without declare(encoding), where byte-by-byte comparrison is done instead. And I say "AFAIK" here, because frankly, I've never used such experiments myself... I'm a "do gooddy 'everything-as-valid-UTF-8'-er".

Lorenz Meyer

As repeated already many times, PHP files do not have any encoding for bytes above x7f. All you can tell is that the bytes x00 to x7f are ascii.

A file with a BOM marker at the beginning is not valid PHP. So there is nothing like a PHP file in iso-8859-1 or utf-8. It is plain 8-bit.

A PHP file is not iso-8859-x, because those encodings do not contain all possible byte values. As you know x7f to x9f are not valid in iso-8859-1, but any PHP file can possibly contain them.

A PHP file is not utf-8 either, because it might contain invalid utf-8 sequences, without being invalid.

The big picture

Charset by convention at writing

A PHP file can have an encoding by convention, but this is up to the discretion of the programmer. He will tell his editor, that such project is in utf-8 or iso-8859-1 or what else.

But again, this is only a convention of the programmer. His editor is threating the PHP file as if it were in such and such encoding. The encoding is merely serving the purpose of displaying the file in the editor and allows the programmer to edit it.

No charset during compilation

As explained above, the compiler does not need to know the encoding the programmer assumed. The only thing that matters is what are the byte sequences in the file.

Implicit or explicit charset defined on consumption

PHP generates some data that is sent over internet to the browser. At the time the browser displays the data, the encoding is definitely defined, but how ?

  • The encoding can be defined in the HTTP header, like this Content-Type: text/html; charset=utf-8
  • It can be defined in the HTML output itself: <meta charset="utf-8">
  • Or if the charset is not defined explicitely, the browser makes an educated guess depending on the byte sequences present in the document (e.g. valid utf-8 sequences or BOM).

Of course it is good practice that an PHP application never lets the browser choose, but there is no requirement that the encoding be defined anywhere.

More details

Normally, the encoding the programmer chooses will be the same which will be used at the end of the chain in the browser, and all strings in the PHP-files will use this same encoding.

But this needs not be the case. There are valid reasons, why this will not be the case. Let's look at examples:

Different languages, different encodings

I use Joomla since it's version 1.0. In this version, the language files had each their own encoding. The french language was iso-8859-1, while the arab files were windows-1256 and russian files koi8-r. For those encoding mattered, but not for all other files, which could be treated equally as utf-8 or iso-5598-1. (Meanwhile, Joomla switched to utf-8.)

Heterogeneous databases

One of our web application connects to two different databases, one happens to be in utf-8, the other one in windows-1252. This means, that all the strings in this project are not in the same encoding. I use utf-8 as much as possible, but I need to thanslate the encodings back and forth using the mb_*group of functions in PHP.

PHP's conversion functions

Merely the presence of the encoding conversion functions mb_convert_encoding, iconv, utf8_encode, etc. suggests that in the same project string of different encodings can be present.

Good practice

Define your encoding and stick to it ! The best choice will be the use of utf-8. If other strings of other encodings are needed, you can always write something like $s=mb_convert_encoding('Уровень','ucs-2','utf8');

Here again: You cannot use BOM markers in PHP. The reason is simple: A BOM marker ar two bytes that come before the opening tag <?php. They are therefore sent to the browser. If one tries to send a header() afterwards, an error is generated, and the header is not sent.

Conclusion

  • In general, there is no need to determine the encoding of a PHP file. Only the encoding of the finally rendered HTML-file is important.
  • It is good practice to edit all files in the same encoding that is used to display the final results. But it really only matters for the language files (if you use any system of i18n at all).
  • While in practice all the strings in one file are in the same encoding, nothing would keep an ill minded programmer to write strings in different encodings in the same file, and still get a working program.

Finally encoding in PHP is only a matter of convention used at writing time, and the charset used in the browser to render the page. In between, a PHP file has no specific encoding, it's just plain 8-bit.

Pekka 웃

There is really no way to reliably tell a PHP source file's encoding. It could be anything really. As you know, the only generic identifier is the BOM, but most people will remove those from their source files as they can cause trouble at output time.

How to deal with this depends on what you want to do. Usually, it doesn't matter because the PHP file will take care of declaring its encoding itself, e.g. by sending a Content-type header (or it is defined implicitly, e.g. because it's part of a project whose convention it is to use a certain encoding). The issue of encoding doesn't really come up because the file sorts it out itself at execution time.

If you're building a tool that manipulates or analyzes PHP source files in some form, chances are the encoding doesn't really matter, but we'd have to know more about your situation to assess that.

The way most IDEs deal with this uncertainty is they ask the developer to manually specify which encoding the project, folder, and / or file are in. Maybe that is an option for you as well.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!