How do you implement a good profanity filter?

后端 未结 21 2331
误落风尘
误落风尘 2020-11-22 04:27

Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs

相关标签:
21条回答
  • 2020-11-22 05:26

    Frankly, I'd let them get the "trick the system" words out and ban them instead, which is just me. But it also makes the programming simpler.

    What I'd do is implement a regex filter like so: /[\s]dooby (doo?)[\s]/i or it the word is prefixed on others, /[\s]doob(er|ed|est)[\s]/. These would prevent filtering words like assuaged, which is perfectly valid, but would also require knowledge of the other variants and updating the actual filter if you learn a new one. Obviously these are all examples, but you'd have to decide how to do it yourself.

    I'm not about to type out all the words I know, not when I don't actually want to know them.

    0 讨论(0)
  • 2020-11-22 05:28

    The only way to prevent offensive user input is to prevent all user input.

    If you insist on allowing user input and need moderation, then incorporate human moderators.

    0 讨论(0)
  • 2020-11-22 05:30

    Once you have a good MYSQL table of some bad words you want to filter (I started with one of the links in this thread), you can do something like this:

    $errors = array();  //Initialize error array (I use this with all my PHP form validations)
    
    $SCREENNAME = mysql_real_escape_string($_POST['SCREENNAME']); //Escape the input data to prevent SQL injection when you query the profanity table.
    
    $ProfanityCheckString = strtoupper($SCREENNAME); //Make the input string uppercase (so that 'BaDwOrD' is the same as 'BADWORD').  All your values in the profanity table will need to be UPPERCASE for this to work.
    
    $ProfanityCheckString = preg_replace('/[_-]/','',$ProfanityCheckString); //I allow alphanumeric, underscores, and dashes...nothing else (I control this with PHP form validation).  Pull out non-alphanumeric characters so 'B-A-D-W-O-R-D' shows up as 'BADWORD'.
    
    $ProfanityCheckString = preg_replace('/1/','I',$ProfanityCheckString); //Replace common numeric representations of letters so '84DW0RD' shows up as 'BADWORD'.
    
    $ProfanityCheckString = preg_replace('/3/','E',$ProfanityCheckString);
    
    $ProfanityCheckString = preg_replace('/4/','A',$ProfanityCheckString);
    
    $ProfanityCheckString = preg_replace('/5/','S',$ProfanityCheckString);
    
    $ProfanityCheckString = preg_replace('/6/','G',$ProfanityCheckString);
    
    $ProfanityCheckString = preg_replace('/7/','T',$ProfanityCheckString);
    
    $ProfanityCheckString = preg_replace('/8/','B',$ProfanityCheckString);
    
    $ProfanityCheckString = preg_replace('/0/','O',$ProfanityCheckString); //Replace ZERO's with O's (Capital letter o's).
    
    $ProfanityCheckString = preg_replace('/Z/','S',$ProfanityCheckString); //Replace Z's with S's, another common substitution.  Make sure you replace Z's with S's in your profanity database for this to work properly.  Same with all the numbers too--having S3X7 in your database won't work, since this code would render that string as 'SEXY'.  The profanity table should have the "rendered" version of the bad words.
    
    $CheckProfanity = mysql_query("SELECT * FROM DATABASE.TABLE p WHERE p.WORD = '".$ProfanityCheckString."'");
    if(mysql_num_rows($CheckProfanity) > 0) {$errors[] = 'Please select another Screen Name.';} //Check your profanity table for the scrubbed input.  You could get real crazy using LIKE and wildcards, but I only want a simple profanity filter.
    
    if (count($errors) > 0) {foreach($errors as $error) {$errorString .= "<span class='PHPError'>$error</span><br /><br />";} echo $errorString;} //Echo any PHP errors that come out of the validation, including any profanity flagging.
    
    
    //You can also use these lines to troubleshoot.
    //echo $ProfanityCheckString;
    //echo "<br />";
    //echo mysql_error();
    //echo "<br />";
    

    I'm sure there is a more efficient way to do all those replacements, but I'm not smart enough to figure it out (and this seems to work okay, albeit inefficiently).

    I believe that you should err on the side of allowing users to register, and use humans to filter and add to your profanity table as required. Though it all depends on the cost of a false positive (okay word flagged as bad) versus a false negative (bad word gets through). That should ultimately govern how aggressive or conservative you are in your filtering strategy.

    I would also be very careful if you want to use wildcards, since they can sometimes behave more onerously than you intend.

    0 讨论(0)
提交回复
热议问题