Unicode and :alpha:

后端 未结 1 424
误落风尘
误落风尘 2021-01-12 09:15

Why is this false:

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

But this is true?:



        
1条回答
  •  执念已碎
    2021-01-12 10:04

    When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

    To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:

    IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)
    

    See the IDEONE demo (prints true)

    See Elixir regex reference:

    unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.

    0 讨论(0)
提交回复
热议问题