发表新帖

发表新帖

Unicode and :alpha:

后端未结

关注

 1  425

Why is this false:

iex(1)> String.match?("汉语漢語", ~r/^[[:alpha:]]+$/)
false

But this is true?:

相关标签:

1条回答

执念已碎

2021-01-12 10:04
When you pass the string to the regex in a non-Unicode mode, it is treated as an array of bytes, not as a Unicode string. See IO.puts byte_size("汉语漢語") (12, all bytes that the input consists of: 230,177,137,232,175,173,230,188,162,232,170,158) and IO.puts String.length("汉语漢語") (4, the Unicode "letters") difference. There are bytes in the string that cannot be matched with the [:alpha:] POSIX character class. Thus, the first expression does not work, while the second works as it only needs 1 character to return a valid match.

To properly match Unicode strings with PCRE regex library (that is used in Elixir), you need to enable the Unicode mode with /u modifier:
```
IO.puts String.match?("汉语漢語", ~r/^[[:alpha:]]+$/u)
```
See the IDEONE demo (prints true)

See Elixir regex reference:

unicode (u) - enables unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on unicode. It expects valid unicode strings to be given on match.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题