Why is this false
:
iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false
But this is true
?:
When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語")
(12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158
) and IO.puts String.length("汉语漢語")
(4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:]
POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.
To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u
modifier:
IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)
See the IDEONE demo (prints true
)
See Elixir regex reference:
unicode (u)
- enables unicode specific patterns like\p
and changes modifiers like\w
,\W
,\s
and friends to also match on unicode. It expects valid unicode strings to be given on match.