How to ban words with diacritics using a blacklist array and regex?

前端 未结 5 1222
醉话见心
醉话见心 2021-01-11 19:16

I have an input of type text where I return true or false depending on a list of banned words. Everything works fine. My problem is that I don\'t know how to check against w

5条回答
  •  再見小時候
    2021-01-11 19:46

    In stead of using word boundary, you could do it with

    (?:[^\w\u0080-\u02af]+|^)
    

    to check for start of word, and

    (?=[^\w\u0080-\u02af]|$)
    

    to check for the end of it.

    The [^\w\u0080-\u02af] matches any characters not (^) being basic Latin word characters - \w - or the Unicode 1_Supplement, Extended-A, Extended-B and Extensions. This include some punctuation, but would get very long to match just letters. It may also have to be extended if other character sets have to be included. See for example Wikipedia.

    Since javascript doesn't support look-behinds, the start-of-word test consumes any before mentioned non-word characters, but I don't think that should be a problem. The important thing is that the end-of-word test doesn't.

    Also, putting these test outside a non capturing group that alternates the words, makes it significantly more effective.

    var bannedWords = ["bad", "mad", "testing", "băţ", "båt", "süß"],
        regex = new RegExp('(?:[^\\w\\u00c0-\\u02af]+|^)(?:' + bannedWords.join("|") + ')(?=[^\\w\\u00c0-\\u02af]|$)', 'i');
    
    function myFunction() {
        document.getElementById('result').innerHTML = 'Banned = ' + regex.test(document.getElementById('word_to_check').value);
    }
    
    
    
    
    Enter word: 
    
    
    

提交回复
热议问题