Allowed characters for CSS identifiers

后端 未结 3 578
伪装坚强ぢ
伪装坚强ぢ 2020-11-27 15:54

What are the (full) valid / allowed charset characters for CSS identifiers id and class?

Is there a regular expression tha

相关标签:
3条回答
  • 2020-11-27 16:39

    This is merely a contribution to @BalusC answer. It is the PHP version of the Java code he provided, I converted it and I thought someone else could find it helpful.

    $h = "[0-9a-f]";
    $unicode = str_replace( "{h}", $h, "\{h}{1,6}(\r\n|[ \t\r\n\f])?" );
    $escape = str_replace( "{unicode}", $unicode, "({unicode}|\[^\r\n\f0-9a-f])");
    $nonascii = "[\240-\377]";
    $nmchar = str_replace( array( "{nonascii}", "{escape}" ), array( $nonascii, $escape ), "([_a-z0-9-]|{nonascii}|{escape})");
    $nmstart = str_replace( array( "{nonascii}", "{escape}" ), array( $nonascii, $escape ), "([_a-z]|{nonascii}|{escape})" );
    $ident = str_replace( array( "{nmstart}", "{nmchar}" ), array( $nmstart, $nmchar ), "-?{nmstart}{nmchar}*");
    
    
    echo $ident; // The full regex.
    
    0 讨论(0)
  • 2020-11-27 16:40

    The charset doesn't matter. The allowed characters matters more. Check the CSS specification. Here's a cite of relevance:

    In CSS, identifiers (including element names, classes, and IDs in selectors) can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code (see next item). For instance, the identifier "B&W?" may be written as "B\&W\?" or "B\26 W\3F".

    Update: As to the regex question, you can find the grammar here:

    ident      -?{nmstart}{nmchar}*
    

    Which contains of the parts:

    nmstart    [_a-z]|{nonascii}|{escape}
    nmchar     [_a-z0-9-]|{nonascii}|{escape}
    nonascii   [\240-\377]
    escape     {unicode}|\\[^\r\n\f0-9a-f]
    unicode    \\{h}{1,6}(\r\n|[ \t\r\n\f])?
    h          [0-9a-f]
    

    This can be translated to a Java regex as follows (I only added parentheses to parts containing the OR and escaped the backslashes):

    String h = "[0-9a-f]";
    String unicode = "\\\\{h}{1,6}(\\r\\n|[ \\t\\r\\n\\f])?".replace("{h}", h);
    String escape = "({unicode}|\\\\[^\\r\\n\\f0-9a-f])".replace("{unicode}", unicode);
    String nonascii = "[\\240-\\377]";
    String nmchar = "([_a-z0-9-]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
    String nmstart = "([_a-z]|{nonascii}|{escape})".replace("{nonascii}", nonascii).replace("{escape}", escape);
    String ident = "-?{nmstart}{nmchar}*".replace("{nmstart}", nmstart).replace("{nmchar}", nmchar);
    
    System.out.println(ident); // The full regex.
    

    Update 2: oh, you're more a PHP'er, well I think you can figure how/where to do str_replace?

    0 讨论(0)
  • 2020-11-27 16:57

    For anyone looking for something a little more turn-key. The full expression, replaced and all, from @BalusC's answer is:

    /-?([_a-z]|[\240-\377]|([0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|[^\r\n\f0-9a-f]))([_a-z0-9-]|[\240-\377]|([0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?|[^\r\n\f0-9a-f]))*/
    

    And using DEFINE, which I find a little more readable:

    /(?(DEFINE)
        (?P<h>        [0-9a-f]                             )
        (?P<unicode>  (?&h){1,6}(\r\n|[ \t\r\n\f])?        )
        (?P<escape>   ((?&unicode)|[^\r\n\f0-9a-f])*       )
        (?P<nonascii> [\240-\377]                          )
        (?P<nmchar>   ([_a-z0-9-]|(?&nonascii)|(?&escape)) )
        (?P<nmstart>  ([_a-z]|(?&nonascii)|(?&escape))     )
        (?P<ident>    -?(?&nmstart)(?&nmchar)*             )
    ) (?:
        (?&ident)
    )/x
    

    Incidentally, the original regular expression (and @human's contribution) had a few rogue escape characters that allow [ in the name.

    Also, it should be noted that the raw regex without, DEFINE, runs about 2x as fast as the DEFINE expression, taking only ~23 steps to identify a single unicode character, while the later takes ~40.

    0 讨论(0)
提交回复
热议问题