“bad words” filter [closed]

Not very technical, but... I have to implement a bad words filter in a new site we are developing. So I need a "good" bad words list to feed my db with... any hint / direction? Looking around with google I found this one, and it's a start, but nothing more.

Yes, I know that this kind of filters are easily escaped... but the client will is the client will !!! :-)

The site will have to filter out both english and italian words, but for italian I can ask my colleagues to help me with a community-built list of "parolacce" :-) - an email will do.

Thanks for any help.

I didn't see any language specified but you can use this for PHP it will generate a RegEx for each instered work so that even intentional mis-spellings (i.e. @ss, i3itch ) will also be caught.

<?php

/**
 * @author unkwntech@unkwndesign.com
 **/

if($_GET['act'] == 'do')
 {
    $pattern['a'] = '/[a]/'; $replace['a'] = '[a A @]';
    $pattern['b'] = '/[b]/'; $replace['b'] = '[b B I3 l3 i3]';
    $pattern['c'] = '/[c]/'; $replace['c'] = '(?:[c C (]|[k K])';
    $pattern['d'] = '/[d]/'; $replace['d'] = '[d D]';
    $pattern['e'] = '/[e]/'; $replace['e'] = '[e E 3]';
    $pattern['f'] = '/[f]/'; $replace['f'] = '(?:[f F]|[ph pH Ph PH])';
    $pattern['g'] = '/[g]/'; $replace['g'] = '[g G 6]';
    $pattern['h'] = '/[h]/'; $replace['h'] = '[h H]';
    $pattern['i'] = '/[i]/'; $replace['i'] = '[i I l ! 1]';
    $pattern['j'] = '/[j]/'; $replace['j'] = '[j J]';
    $pattern['k'] = '/[k]/'; $replace['k'] = '(?:[c C (]|[k K])';
    $pattern['l'] = '/[l]/'; $replace['l'] = '[l L 1 ! i]';
    $pattern['m'] = '/[m]/'; $replace['m'] = '[m M]';
    $pattern['n'] = '/[n]/'; $replace['n'] = '[n N]';
    $pattern['o'] = '/[o]/'; $replace['o'] = '[o O 0]';
    $pattern['p'] = '/[p]/'; $replace['p'] = '[p P]';
    $pattern['q'] = '/[q]/'; $replace['q'] = '[q Q 9]';
    $pattern['r'] = '/[r]/'; $replace['r'] = '[r R]';
    $pattern['s'] = '/[s]/'; $replace['s'] = '[s S $ 5]';
    $pattern['t'] = '/[t]/'; $replace['t'] = '[t T 7]';
    $pattern['u'] = '/[u]/'; $replace['u'] = '[u U v V]';
    $pattern['v'] = '/[v]/'; $replace['v'] = '[v V u U]';
    $pattern['w'] = '/[w]/'; $replace['w'] = '[w W vv VV]';
    $pattern['x'] = '/[x]/'; $replace['x'] = '[x X]';
    $pattern['y'] = '/[y]/'; $replace['y'] = '[y Y]';
    $pattern['z'] = '/[z]/'; $replace['z'] = '[z Z 2]';
    $word = str_split(strtolower($_POST['word']));
    $i=0;
    while($i < count($word))
     {
        if(!is_numeric($word[$i]))
         {
            if($word[$i] != ' ' || count($word[$i]) < '1')
             {
                $word[$i] = preg_replace($pattern[$word[$i]], $replace[$word[$i]], $word[$i]);
             }
         }
        $i++;
     }
    //$word = "/" . implode('', $word) . "/";
    echo implode('', $word);
 }

if($_GET['act'] == 'list')
 {
    $link = mysql_connect('localhost', 'username', 'password', '1');
    mysql_select_db('peoples');
    $sql = "SELECT word FROM filters";
    $result = mysql_query($sql, $link);
    $i=0;
    while($i < mysql_num_rows($result))
     {
        echo mysql_result($result, $i, 'word') . "<br />";
        $i++;
     }
     echo '<hr>';
 }
?>
<html>
    <head>
        <title>RegEx Generator</title>
    </head>
    <body>
        <form action='badword.php?act=do' method='post'>
            Word: <input type='text' name='word' /><br />
            <input type='submit' value='Generate' />
        </form>
        <a href="badword.php?act=list">List Words</a>
    </body>
</html>

AgentConundrum

Beware of clbuttic mistakes.

"Apple made the clbuttic mistake of forcing out their visionary - I mean, look at what NeXT has been up to!"

Hmm. "clbuttic".

Google "clbuttic" - thousands of hits!

There's someone who call his car 'clbuttic'.

There are "Clbuttic Steam Engine" message boards.

Webster's dictionary - no help.

Hmm. What can this be?

HINT: People who make buttumptions about their regex scripts, will be embarbutted when they repeat this mbuttive mistake.

Shutterstock has a Github repo with a list of bad words used for filtering.

You can check it out here: https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

If anyone needs an API, google currently provide a bad word indicator.

http://www.wdyl.com/profanity?q=naughtyword

{
response: "false"
}

Update: Google has now removed this service.

I would say to just remove posts as you become aware of them, and block users who are overly explicit with their postings. You can say very offensive things without using any swear words. If you block the word ass (aka donkey), then people will just type a$$ or /\55, or whatever else they need to type to get past the filter.

+1 on the Clbuttic mistake, I think it is important for "bad word" filters to scan for both leading and trailing spaces (e.g., " ass ") as opposed for just the exact string so that we won't have words like clbuttic, clbuttes, buttert, buttess, etc.

Wikipedia ClueBot has a bad word filter, read its source.

http://en.wikipedia.org/wiki/User:ClueBot/Source#Score_list

You could always convince the client to have a session of users just constantly posting expletives and make an easy solution to add them to the system. It is a lot of work but it will probably be more representative of the community.

In researching this topic I determined that what was needed was more than just a list that does arbitrary replacements. I have built a web service that allows you to identify the level of 'cleanliness' you desire. It also makes an effort to identify false positives - i.e. where a word may be bad in one context but not in others. Take a look at http://filterlanguage.com

来源：https://stackoverflow.com/questions/24515/bad-words-filter

标签

list

dictionary

profanity