How do you implement a good profanity filter?

后端 未结 21 2332
误落风尘
误落风尘 2020-11-22 04:27

Many of us need to deal with user input, search queries, and situations where the input text can potentially contain profanity or undesirable language. Oftentimes this needs

相关标签:
21条回答
  • 2020-11-22 05:17

    I collected 2200 bad words in 12 languages: en, ar, cs, da, de, eo, es, fa, fi, fr, hi, hu, it, ja, ko, nl, no, pl, pt, ru, sv, th, tlh, tr, zh.

    MySQL dump, JSON, XML or CSV options are available.

    https://github.com/turalus/openDB

    I'd suggest you to execute this SQL into your DB and check everytime when user inputs something.

    0 讨论(0)
  • 2020-11-22 05:22

    Obscenity Filters: Bad Idea, or Incredibly Intercoursing Bad Idea?

    Also, one can't forget The Untold History of Toontown's SpeedChat, where even using a "safe-word whitelist" resulted in a 14 year old quickly circumventing it with: "I want to stick my long-necked Giraffe up your fluffy white bunny."

    Bottom line: Ultimately, for any system that you implement, there is absolutely no substitute for human review (whether peer or otherwise). Feel free to implement a rudimentary tool to get rid of the drive-by's, but for the determined troll, you absolutely must have a non-algorithm-based approach.

    A system that removes anonymity and introduces accountability (something that Stack Overflow does well) is helpful also, particularly in order to help combat John Gabriel's G.I.F.T.

    You also asked where you can get profanity lists to get you started -- one open-source project to check out is Dansguardian -- check out the source code for their default profanity lists. There is also an additional third party Phrase List that you can download for the proxy that may be a helpful gleaning point for you.

    Edit in response the question edit: Thanks for the clarification on what you're trying to do. In that case, if you're just trying to do a simple word filter, there are two ways you can do it. One is to create a single long regexp with all of the banned phrases that you want to censor, and merely do a regex find/replace with it. A regex like:

    $filterRegex = "(boogers|snot|poop|shucks|argh)"
    

    and run it on your input string using preg_match() to wholesale test for a hit,

    or preg_replace() to blank them out.

    You can also load those functions up with arrays rather than a single long regex, and for long word lists, it may be more manageable. See the preg_replace() for some good examples as to how arrays can be used flexibly.

    For additional PHP programming examples, see this page for a somewhat advanced generic class for word filtering that *'s out the center letters from censored words, and this previous Stack Overflow question that also has a PHP example (the main valuable part in there is the SQL-based filtered word approach -- the leet-speak compensator can be dispensed with if you find it unnecessary).

    You also added: "Getting the list of words in the first place is the real question." -- in addition to some of the previous Dansgaurdian links, you may find this handy .zip of 458 words to be helpful.

    0 讨论(0)
  • 2020-11-22 05:22

    Don't. It just leads to problems. One clbuttic personal experience I have with profanity filters is the time where I was kick/banned from an IRC channel for mentioning that I was "heading over the bridge to Hancock for a couple hours" or something to that effect.

    0 讨论(0)
  • 2020-11-22 05:23

    Beware of localization issues: what is a swearword in one language might be a perfectly normal word in another.

    One current example of this: ebay uses a dictionary approach to filter "bad words" from feedback. If you try to enter the german translation of "this was a perfect transaction" ("das war eine perfekte Transaktion"), ebay will reject the feedback due to bad words.

    Why? Because the german word for "was" is "war", and "war" is in ebay dictionary of "bad words".

    So beware of localisation issues.

    0 讨论(0)
  • 2020-11-22 05:26

    a profanity filtering system will never be perfect, even if the programmer is cocksure and keeps abreast of all nude developments

    that said, any list of 'naughty words' is likely to perform as well as any other list, since the underlying problem is language understanding which is pretty much intractable with current technology

    so, the only practical solution is twofold:

    1. be prepared to update your dictionary frequently
    2. hire a human editor to correct false positives (e.g. "clbuttic" instead of "classic") and false negatives (oops! missed one!)
    0 讨论(0)
  • 2020-11-22 05:26

    Regarding your "trick the system" subquestion, you can handle that by normalizing both the "bad word" list and the user-entered text before doing your search. e.g., Use a series of regexes (or tr if PHP has it) to convert [z$5] to "s", [4@] to "a", etc., then compare the normalized "bad word" list against the normalized text. Note that the normalization could potentially lead to additional false positives, although I can't think of any actual cases at the moment.

    The larger challenge is to come up with something that will let people quote "The pen is mightier than the sword" while blocking "p e n i s".

    0 讨论(0)
提交回复
热议问题