How to get all Unicode characters from specific categories?

大兔子大兔子 提交于 2019-12-24 07:58:17

问题


How to get, for example..., a code point pattern like x-y\uxxxx\Uxxxxxxxxx from the Connector Punctuation (Pc) category, for scanning ECMAScript 3/JavaScript identifiers?

Original question

I need help for verifying a valid character (code point) of a ECMA-262 (3º edition, 7.6) identifier for a lexical scanner.

Syntax quote

Identifier ::

  • IdentifierName but not ReservedWord

IdentifierName ::

  • IdentifierStart
  • IdentifierName IdentifierPart
  • IdentifierStart ::
  • UnicodeLetter
  • $
  • _
  • \ UnicodeEscapeSequence # no need to check this

IdentifierPart ::

  • IdentifierStart
  • UnicodeCombiningMark
  • UnicodeDigit
  • UnicodeConnectorPunctuation

UnicodeLetter ::

  • any character in the Unicode categories “Uppercase letter (Lu)”, “Lowercase > letter (Ll)”, “Titlecase letter (Lt)”, “Modifier letter (Lm)”, “Other letter (Lo)”, or “Letter number (Nl)”.

UnicodeCombiningMark ::

  • any character in the Unicode categories “Non-spacing mark (Mn)” or “Combining spacing mark (Mc)”

UnicodeDigit ::

  • any character in the Unicode category “Decimal number (Nd)”

UnicodeConnectorPunctuation ::

  • any character in the Unicode category “Connector punctuation (Pc)”

As you can see, it takes any character of certain categories.

I need to have all these possible characters, so my first step was to locate "Connector punctuation" on this Unicode 5.0 chart, but 0 matches and I believe I'm doing it the wrong way. So could someone help me?


回答1:


Unicode offers this tool for determining sets of characters. It uses regular expressions with property-value pairs enclosed in [::].

For all characters in Unicode 5 you want to do [:age=5.0:].

The rest are "general categories" (gc). So for example [:age=5.0:]&[:gc=Lu:] will find all uppercase letters in Unicode 5 (gc=L will find all letters in general).

For IdentifierStart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:]\$_]. For IdentifierPart you need [:age=5.0:]&[[:gc=L:][:gc=Nl:][:gc=Mn:][:gc=Mc:][:gc=Nd:][:gc=Pc:]\$_].

Unicode also has properties called ID_Start and ID_Continue but they don't include the same characters as your specifications.

Here is also an overview of all Unicode character properties.



来源:https://stackoverflow.com/questions/43150498/how-to-get-all-unicode-characters-from-specific-categories

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!