I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don\'t know how to check against w
Let's see what's going on:
alert("băţ".match(/\w\b/));
This is [ "b" ]
because word boundary \b
doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z]
, so aä
, pπ
, and zƶ
match \w\b\W
since they contain a word character, a word boundary, and a non-word character.
I think the best you can do is something like this:
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
+ bannedWords.join('|')
+ ')(?=' + bound + '|$)', 'i');
where bound
is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b
. (The second of which is a zero-width lookahead that better mimics \b
and therefore works well with the g
regex flag.)
Given ["bad", "mad", "testing", "băţ"]
, this becomes:
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
This doesn't need anything like ….join('\\b|\\b')…
because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b
, which is akin to \bhey\b\b|\b\byou\b
, including the nonsensical \b\b
– which JavaScript interprets as merely \b
).
You can also use var bound = '[\\s!-/:-@[-`{-~]'
for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.