发表新帖

发表新帖

How do you specify a regex character range that will work in European languages other than English?

后端未结

关注

 2  1275

I\'m working with Ruby\'s regex engine. I need to write a regex that does this

WIKI_WORD = /\\b([a-z][\\w_]+\\.)?[A-Z][a-z]+[A-Z]\\w*\\b/

b

相关标签:

2条回答

星月不相逢

2021-01-12 20:31
```
WIKI_WORD = /\b(\p{Ll}\w+\.)?\p{Lu}\p{Ll}+\p{Lu}\w*\b/u
```
should work in Ruby 1.9. \p{Lu} and \p{Ll} are shorthands for uppercase and lowercase Unicode letters. (\w already includes the underscore)

See also this answer - you might need to run Ruby in UTF-8 mode for this to work, and possibly your script must be encoded in UTF-8, too.
0 讨论(0)
发布评论:

提交评论
- 加载中...
灰色年华

2021-01-12 20:46
James Grey wrote a series of articles on working with Unicode, UTF-8 and Ruby 1.8.7 and 1.9.2. They're important reading.

With Ruby 1.8.7, we could add:
```
#!/usr/bin/ruby -kU
require 'jcode'
```
and get partial UTF-8 support.

With 1.9.2 you can use:
```
# encoding: UTF-8
```
as the second line of your source file and that will tell Ruby to default to UTF-8. Grey's recommendation is we do that with all source we write from now on.

That will not affect external encoding when reading/writing text, only the encoding of the source code.

Ruby 1.9.2 doesn't extend the usual \w, \W and \s character classes to handle UTF-8 or Unicode. As the other comments and answers said, only the POSIX and Unicode character-sets in regex do that.
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题