Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check f
At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.
A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?
Some quick thoughts:
Edit:
You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.
Here are a couple of useful links:
A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.
Data Mining (ppt) - This was written by a colleague of mine.
Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).
I'd make a huge list.
Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.
Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?
Note that any NLP logic like this will be subject to attacks of "character replacement":
For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)