Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs
I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!
During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?
Of course, the most foul word in the English language.
Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).
For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.
Have a look at CDYNE's Profanity Filter Web Service
Testing URL
I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:
Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.
Also see this blog post for more details:
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.
On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.
Thank you for the ideas.
HanClinto rules!
I concluded, in order to create a good profanity filter we need 3 main components, or at least it is what I am going to do. These they are:
A bonus, it will be to reward somehow those who contribute with accurate abuse reporters and punish the offender, e.g. suspend their accounts.