Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check f
There are webservices that do this kind of thing in English.
I'm sure there are others, but I've used WebPurify in a project for precisely this reason before.
One problem with filters of this kind is their tendency to flag entirely proper English town names like Scunthorpe. While that can be reduced by checking the whole word rather than parts, you then find people taking advantage by merging their offensive words with adjacent text.
Use the morphy lemmatizer built into WordNet, and then determine whether the lemma is an obscenity. This will solve the problem of different verb forms, plurals, etc...
I would advocate a large list of simple regex's. Smaller than a list of the variants, but not trying to capture anything more than letter alternatives in any given expression: like "f[u_-@#$%^&*.]ck".
It depends what your text source is, but I'd go for some kind of established and proven pattern matching algorithm, using a Trie for example.