Natural Language Processing: Find obscenities in English?

后端 未结 11 1257
自闭症患者
自闭症患者 2021-02-09 21:15

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check f

相关标签:
11条回答
  • 2021-02-09 21:56

    At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.

    0 讨论(0)
  • 2021-02-09 21:57

    A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?

    Some quick thoughts:

    • The Scunthorpe problem (and follow the links to "Swear filter" for more)
    • British or American English? fanny, fag etc
    • Political correctness: "black" or "Afro-American"?

    Edit:

    • Be very careful and again here. Normal words can offend, whether by choice or ignorance
    0 讨论(0)
  • 2021-02-09 21:57

    You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.

    Here are a couple of useful links:

    A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.

    Data Mining (ppt) - This was written by a colleague of mine.

    Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).

    0 讨论(0)
  • 2021-02-09 21:58

    I'd make a huge list.

    Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.

    0 讨论(0)
  • 2021-02-09 21:59

    Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?

    0 讨论(0)
  • 2021-02-09 22:02

    Note that any NLP logic like this will be subject to attacks of "character replacement":

    For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)

    0 讨论(0)
提交回复
热议问题