How do you implement a good profanity filter?

后端未结

关注

 21  2333

Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs

相关标签:

21条回答

别跟我提以往

2020-11-22 05:12

I don't know of any good libraries for this, but whatever you do, make sure that you err in the direction of letting stuff through. I've dealt with systems that wouldn't allow me to use "mpassell" as a username, because it contains "ass" as a substring. That's a great way to alienate users!

0 讨论(0)
发布评论:

提交评论
- 加载中...
别那么骄傲

2020-11-22 05:12

During a job interview of mine, the company CTO who was interviewing me tried out a word/web game I wrote in Java. Out of a word list of the entire Oxford English dictionary, what was the first word that came up to be guessed?

Of course, the most foul word in the English language.

Somehow, I still got the job offer, but I then tracked down a profanity word list (not unlike this one) and wrote a quick script to generate a new dictionary without all of the bad words (without even having to look at the list).

For your particular case, I think comparing the search to real words sounds like the way to go with a word list like that. The alternative styles/punctuation require a bit more work, but I doubt users will use that often enough to be an issue.

0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2020-11-22 05:13

Have a look at CDYNE's Profanity Filter Web Service

Testing URL

0 讨论(0)
发布评论:

提交评论
- 加载中...
走了就别回头了

2020-11-22 05:14
I agree with the futility of the subject, but if you have to have a filter, check out Ning's Boxwood:

Boxwood is a PHP extension for fast replacement of multiple words in a piece of text. It supports case-sensitive and case-insensitive matching. It requires that the text it operates on be encoded as UTF-8.

Also see this blog post for more details:
- Fast Multiple String Replacement in PHP
With Boxwood, you can have your list of search terms be as long as you like -- the search and replace algorithm doesn't get slower with more words on the list of words to look for. It works by building a trie of all the search terms and then scans your subject text just once, walking down elements of the trie and comparing them to characters in your text. It supports US-ASCII and UTF-8, case-sensitive or insensitive matching, and has some English-centric word boundary checking logic.
0 讨论(0)
发布评论:

提交评论
- 加载中...
时光取名叫无心

2020-11-22 05:15

I agree with HanClinto's post higher up in this discussion. I generally use regular expressions to string-match input text. And this is a vain effort, as, like you originally mentioned you have to explicitly account for every trick form of writing popular on the net in your "blocked" list.

On a side note, while others are debating the ethics of censorship, I must agree that some form is necessary on the web. Some people simply enjoy posting vulgarity because it can be instantly offensive to a large body of people, and requires absolutely no thought on the author's part.

Thank you for the ideas.

HanClinto rules!

0 讨论(0)
发布评论:

提交评论
- 加载中...
忘掉有多难

2020-11-22 05:15
I concluded, in order to create a good profanity filter we need 3 main components, or at least it is what I am going to do. These they are:
1. The filter: a background service that verify against a blacklist, dictionary or something like that.
2. Not allow anonymous account
3. Report abuse
A bonus, it will be to reward somehow those who contribute with accurate abuse reporters and punish the offender, e.g. suspend their accounts.
0 讨论(0)
发布评论:

提交评论
- 加载中...