How to get all Unicode characters from specific categories?

问题

How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?

Original question

I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.

Syntax quote

Identifier ::

IdentifierName but not ReservedWord

IdentifierName ::

IdentifierStart

IdentifierName IdentifierPart

IdentifierStart ::

UnicodeLetter

$

_

~~\ UnicodeEscapeSequence~~ # no need to check this

IdentifierPart ::

IdentifierStart

UnicodeCombiningMark

UnicodeDigit

UnicodeConnectorPunctuation

UnicodeLetter ::

any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

UnicodeCombiningMark ::

any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”

UnicodeDigit ::

any character in the Unicode category “Decimal number (Nd)”

UnicodeConnectorPunctuation ::

any character in the Unicode category “Connector punctuation (Pc)”

As you can see, it takes any character of certain categories.

I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?

回答1:

Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in [::].

For all characters in Unicode 5 you want to do [:age=5.0:].

The rest are "general categories" (gc). So for example [:age=5.0:]&[:gc=Lu:] will find all uppercase letters in Unicode 5 (gc=L will find all letters in general).

For IdentifierStart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]. For IdentifierPart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_].

Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.

Here is also an overview of all Unicode character properties.

来源：https://stackoverflow.com/questions/43150498/how-to-get-all-unicode-characters-from-specific-categories

标签

javascript

unicode

ecmascript-3

ecmascript-4