Ruby regexp handling of nbsp

♀尐吖头ヾ 提交于 2019-12-22 05:43:19

问题


In ruby 1.9.3 the regex engine doesn't treat nbsp's (\u00A0) as a space (\s). This is often a bummer for me.

So my question is, will this change in 2.0? If not, is there any way to monkey patch a solution?


回答1:


Use Unicode properties (you need to declare a matching source code encoding for this to work):

# encoding=utf-8 
if subject ~= /\p{Z}/
    # subject contains whitespace or other separators

or use POSIX character classes:

if subject ~= /[[:space:]]/

According to the docs, \s will only match [ \t\r\n\f] now and in the future.




回答2:


In Ruby, I recommend using the Unicode character class of "Space separators" \p{Zs}:

/\p{Zs}/u  =~  "\xC2\xA0"
/\p{Zs}/u  =~  "\u00A0"
/\p{Zs}/u  =~  HTMLEntities.new.decode(' ')

See the Ruby-documentation for more Unicode character properties.

Note: Make sure, that your input-string is valid UTF-8 encoding. There are non-breaking spaces in other encodings too, e.g. "\xA0" in ISO-8859-1 (Latin1). More info on the "non-breaking space".

FYI: In most RegExp flavors and programming languages that support Unicode, character class \s usually includes all characters from the Unicode "separator" property \p{Z} (as mentioned by Tim Pietcker); However, Java and Ruby are popular exceptions here and \s only matches [ \t\r\n\f].



来源:https://stackoverflow.com/questions/13287701/ruby-regexp-handling-of-nbsp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!