How to ban words with diacritics using a blacklist array and regex?

前端 未结 5 1219
醉话见心
醉话见心 2021-01-11 19:16

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don\'t know how to check against w

5条回答
  •  囚心锁ツ
    2021-01-11 19:56

    Let's see what's going on:

    alert("băţ".match(/\w\b/));
    

    This is [ "b" ] because word boundary \b doesn't recognize word characters beyond ASCII. JavaScript's "word characters" are strictly [0-9A-Z_a-z], so , , and match \w\b\W since they contain a word character, a word boundary, and a non-word character.

    I think the best you can do is something like this:

    var bound = '[^\\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]';
    var regex = new RegExp('(?:^|' + bound + ')(?:'
                           + bannedWords.join('|')
                           + ')(?=' + bound + '|$)', 'i');
    

    where bound is a reversed list of all ASCII word characters plus most Latin-esque letters, used with start/end of line markers to approximate an internationalized \b. (The second of which is a zero-width lookahead that better mimics \b and therefore works well with the g regex flag.)

    Given ["bad", "mad", "testing", "băţ"], this becomes:

    /(?:^|[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe])(?:bad|mad|testing|băţ)(?=[^\w\u00c0-\u02c1\u037f-\u0587\u1e00-\u1ffe]|$)/i
    

    This doesn't need anything like ….join('\\b|\\b')… because there are parentheses around the list (and that would create things like \b(?:hey\b|\byou)\b, which is akin to \bhey\b\b|\b\byou\b, including the nonsensical \b\b – which JavaScript interprets as merely \b).

    You can also use var bound = '[\\s!-/:-@[-`{-~]' for a simpler ASCII-only list of acceptable non-word characters. Be careful about that order! The dashes indicate ranges between characters.

提交回复
热议问题