Exotic names for methods, constants, variables and fields - Bug or Feature?

后端 未结 4 1812
死守一世寂寞
死守一世寂寞 2020-11-28 07:26

after some confusion in the comments to

  • Is it safe to have 1 letter class names in PHP, e.g A, B, C

I thought I make into a question. According

相关标签:
4条回答
  • 2020-11-28 07:27

    Your character is encoded as 0x80 0x90 0xe2 or something like that, thus it matches your regexp when not interpreting the unicode (working on single bytes).

    0 讨论(0)
  • 2020-11-28 07:29

    From the official documentation:

    The class name can be any valid label, provided it is not a PHP reserved word. A valid class name starts with a letter or underscore, followed by any number of letters, numbers, or underscores. As a regular expression, it would be expressed thus: ^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$.

    0 讨论(0)
  • 2020-11-28 07:45

    From my understanding, the current versions of PHP have some unicode support, but it is inconsistent. As others have suggested, this was going to be addressed in PHP6, which was canceled (not postponed). At the end of the day, some "exotic" characters will work, and others won't; and obviously, as you suggested, it is better to stick with A-Za-z0-9_.

    At the same time, I have heard rumors that the unicode discussion was recently restarted, presumably from scratch, as the original proposal for UTF-16 in PHP6 involved tons of effort with very little return.

    Side note: From what I have read, the next major PHP release will be PHP 5.4, which might feature horizontal integration (traits), array shorthand, built-in HTTP server, and some other much needed functionality.

    http://www.mail-archive.com/internals@lists.php.net/msg35720.html

    0 讨论(0)
  • 2020-11-28 07:50

    This question starts to mention class names in the title, but then goes on to an example that includes exotic names for methods, constants, variables, and fields. There are actually different rules for these. Let's start with the case insensitive ones.

    Case-insensitive identifiers (class and function/method names)

    The general guideline here would be to use only printable ASCII characters. The reason is that these identifiers are normalized to their lowercase version, however, this conversion is locale-dependent. Consider the following PHP file, encoded in ISO-8859-1:

    <?php
    function func_á() { echo "worked"; }
    func_Á();
    

    Will this script work? Maybe. It depends on what tolower(193) will return, which is locale-dependent:

    $ LANG=en_US.iso88591 php a.php
    worked
    $ LANG=en_US.utf8 php a.php
    
    Fatal error: Call to undefined function func_Á() in /home/glopes/a.php on line 3
    

    Therefore, it's not a good idea to use non-ASCII characters. However, even ASCII characters may give trouble in some locales. See this discussion. It's likely that this will be fixed in the future by doing a locale-independent lowercasing that only works with ASCII characters.

    In conclusion, if we use multi-byte encodings for these case-insensitive identifiers, we're looking for trouble. It's not just that we can't take advantage of the case insensitivity. We might actually run into unexpected collisions because all the bytes that compose a multi-byte character are individually turned into lowercase using locale rules. It's possible that two different multi-byte characters map to the same modified byte stream representation after applying the locale lowercase rules to each of the bytes.

    Case-sensitive identifiers (variables, constants, fields)

    The problem is less serious here, since these identifiers are case sensitive. However, they are just interpreted as bytestreams. This means that if we use Unicode, we must consistently use the same byte representation; we can't mix UTF-8 and UTF-16; we also can't use BOMs.

    In fact, we must stick to UTF-8. Outside of the ASCII range, UTF-8 uses lead bytes from 0xc0 to 0xfd and the trail bytes are in the range 0x80 to 0xbf, which are in the allowed range per the manual. Now let's say we use the character "Ġ" in a UTF-16BE encoded file. This will translate to 0x01 0x20, so the second byte will be interpreted as a space.

    Having multi-byte characters being read as if they were single-byte characters is, of course, no Unicode support at all. PHP does have some multi-byte support in the form of the compilation switch "--enable-zend-multibyte" (as of PHP 5.4, multibyte support is compiled in by default, but disabled; you can enable it with zend.multibyte=On in php.ini). This allows you to declare the encoding of the the script:

    <?php
    declare(encoding='ISO-8859-1');
    // code here
    ?>
    

    It will also handle BOMs, which are used to auto-detect the encoding and do not become part of the output. There are, however, a few downsides:

    • Peformance hit, both memory and cpu. It stores a representation of the script in an internal multi-byte encoding, which takes more space (and it also seems to store in memory the original version) and it also spends some CPU converting the encoding.
    • Multi-byte support is usually not compiled in, so it's less tested (more bugs).
    • Portability issues between installations that have the support compiled in and those that don't.
    • Refers only to the parsing stage; does not solve the problem outlined for case-insensitive identifiers.

    Finally, there is the problem of lack of normalization – the same character may be represented with different Unicode code points (independently of the encoding). This may lead to some very difficult to track bugs.

    0 讨论(0)
提交回复
热议问题