How to ban words with diacritics using a blacklist array and regex?

前端未结

关注

 5  1223

醉话见心 2021-01-11 19:16

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don\'t know how to check against w

5条回答

囚心锁ツ (楼主)

2021-01-11 19:56
Let's see what's going on:
```
alert("băţ".match(/\w\b/));
```
This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so aä, pπ, and zƶ match \w\b\W since they contain a word character, a word boundary, and a non-word character.

I think the best you can do is something like this:
```
var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
var regex = new RegExp('(?:^|' + bound + ')(?:'
                       + bannedWords.join('|')
                       + ')(?=' + bound + '|$)', 'i');
```
where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)

Given ["bad", "mad", "testing", "băţ"], this becomes:
```
/(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
```
This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).

You can also use var bound = '[\\s!-/:-@[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...