Why does Ruby /[[:punct:]]/ miss some punctuation characters?

前端 未结 2 1188
北恋
北恋 2020-12-07 01:11

Ruby /[[:punct:]]/ is supposed to match all \"punctuation characters\". According to Wikipedia, this means /[\\]\\[!\"#$%&\'()*+,./:;<=>?@\\^_`{

相关标签:
2条回答
  • 2020-12-07 01:27

    The punctuation character class is defined by the locale. The Open Group LC_TYPE definition for punct says:

    Define characters to be classified as punctuation characters. In the POSIX locale, neither the <space> nor any characters in classes alpha, digit, or cntrl shall be included. In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, cntrl, xdigit, or as the shall be specified.

    Basically, it defines how punct can be defined by exluding other character classes, but it doesn't actually define the punctuation symbols directly--that's the locale's job.

    I couldn't find a canonical reference to what is in each locale. Maybe someone else knows. Meanwhile, you can find an LC_TYPE that matches the punct character class you want, or just specify the class directly.

    0 讨论(0)
  • 2020-12-07 01:27

    The greater than symbol is in the "Symbol, Math" category, not the punctuation category. You can see this if you force the regex's encoding to UTF-8 (it defaults to the source encoding, and presumably your source is UTF-8 encoded, while my default source is something else):

    2.1.2 :004 > /[[:punct:]]/u =~ '<'
     => nil 
    2.1.2 :005 > /[[:punct:]]/ =~ '<'
     => 0 
    

    If you force the regex to ASCII encoding (/n - more options here) you'll see it categorize '<' in punct, which I think is what you want. However, this will probably cause problems if your source contains characters outside the ASCII subset of UTF-8.

    2.1.2 :009 > /[[:punct:]]/n =~ '<'
     => 0 
    

    A better solution would be to use the 'Symbol' category instead in your regex instead of the 'punct' one, which matches '<' in UTF-8 encoding:

    2.1.2 :012 > /\p{S}/u =~ '<'
     => 0 
    

    There's a longer list of categories here.

    0 讨论(0)
提交回复
热议问题