Why does \w match only English words in javascript regex?

问题

I'm trying to find URLs in some text, using javascript code. The problem is, the regular expression I'm using uses \w to match letters and digits inside the URL, but it doesn't match non-english characters (in my case - Hebrew letters).

So what can I use instead of \w to match all letters in all languages?

回答1:

Because \w only matches ASCII characters 48-57 ('0'-'9'), 67-90 ('A'-'Z') and 97-122 ('a'-'z'). Hebrew characters and other special foreign language characters (for example, umlaut-o or tilde-n) are outside of that range.

Instead of matching foreign language characters (there are so many of them, in many different ASCII ranges), you might be better off looking for the characters that delineate your words - spaces, quotation marks, and other punctuation.

回答2:

I think you are looking for this regex:

^[אבגדהוזחטיכלמנסעפצקרשתץףןםa-zA-z0-9\s\.\-_\\\/]+$

回答3:

The ECMA 262 v3 standard, which defines the programming language commonly known as JavaScript, stipulates that \w should be equivalent to [a-zA-Z0-9_] and that \d should be equivalent to [0-9]. \s on the other hand matches both ASCII and Unicode whitespace, according to the standard.

JavaScript does not support the \p syntax for matching Unicode things either, so there isn't a good way to do this. You could match all Hebrew characters with:

[\u0590-\u05FF]

This simply matches any code point in the Hebrew block.

You can match any ASCII word character or any Hebrew character with:

[\w\u0590-\u05FF]

回答4:

I've just found XRegExp which has not been mentioned yet and I'm quite impressed with it. It is an alternative regular expression implementation, has a unicode plugin and is licensed under MIT license.

According to the website, to match unicode chars, you'd use such code:

var unicodeWord = XRegExp("^\\p{L}+$");

unicodeWord.test("Русский"); // true
unicodeWord.test("日本語"); // true
unicodeWord.test("العربية"); // true

回答5:

Try this \p{L} the unicode regex to Letters

回答6:

Have a look at http://www.regular-expressions.info/refunicode.html.

It looks like there is no \w equivalent for unicode, but you can match single unicode letters, so you can create it.

回答7:

Check this SO Question about JavaScript and Unicode out. Looks like Jan Goyvaerts answer there provides some hope for you.

Edit: But then it seems all browsers don't support \p ... anyway. That question should contain useful info.

回答8:

Note that URIs (as superset of URLs) are specified by W3C to only allow US-ASCII characters. Normally all other characters should be represented by percent-notation:

In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced. Such a definition should specify the character encoding used to map those characters to octets prior to being percent-encoded for the URI. // URI: Generic Syntax

Which is what generally happens when you open an URL with non-ASCII characters in browser, they get translated into %AB notation, which, in turn, is US-ASCII.

If it is possible to influence the way the material is created, the best option would be to subject URLs to urlencode() type function during their creation.

回答9:

Perhaps \S (non-whitespace).

回答10:

If you're the one generating URLs with non-english letters in it, you may want to reconsider.

If I'm interpreting the W3C correctly, URLs may only contain word characters within the latin alphabet.

来源：https://stackoverflow.com/questions/397788/why-does-w-match-only-english-words-in-javascript-regex

标签

javascript

regex

hebrew