Natural Language Processing: Find obscenities in English?

后端未结

关注

 11  1272

Given a set of words tagged for part of speech, I want to find those that are obscenities in mainstream English. How might I do this? Should I just make a huge list, and check f

相关标签:

11条回答

执念已碎

2021-02-09 21:56

At Melissa Data, when my manager , the director of Massachusetts Research and Development and I refactored a Data Profiler targeted at Relational Databases , we counted profanities by the number of Levinshtein Distance matches where the number of insertions, deletions and substitutions is tunable by the user so as to allow for spelling mistakes, Germanic equivalents of English language, plurals, as well as whitespace and non-whitespace punctuation. We speeded up the running time of the Levinshtein Distance calculation by looking only in the diagonal bands of the n by n matrix.

0 讨论(0)
发布评论:

提交评论
- 加载中...
攒了一身酷

2021-02-09 21:57
A huge list and think of the target audience. Is there 3rd party service that you can use that specialises in this rather than rolling your own?

Some quick thoughts:
- The Scunthorpe problem (and follow the links to "Swear filter" for more)
- British or American English? fanny, fag etc
- Political correctness: "black" or "Afro-American"?
Edit:
- Be very careful and again here. Normal words can offend, whether by choice or ignorance
0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2021-02-09 21:57

You want to use Bayesian Analysis to solve this problem. Bayesian probability is a powerful technique used by spam filters to detect spam/phishing messages in your email inbox. You can train your analysis engine so that it can improve over time. The ability to detect a legitimate email vs. a spam email sounds identical to the problem you are experiencing.

Here are a couple of useful links:

A Plan For Spam - The first proposal to use Bayesian analysis to combat spam.

Data Mining (ppt) - This was written by a colleague of mine.

Classifier4J - A text classifier library written in Java (they exist for every language, but you tagged this question with Java).

0 讨论(0)
发布评论:

提交评论
- 加载中...
无人共我

2021-02-09 21:58

I'd make a huge list.

Regex'es have the problem of misfiring, when applied to natural language - especially with an amount of exceptions English has.

0 讨论(0)
发布评论:

提交评论
- 加载中...
-上瘾入骨i

2021-02-09 21:59

Is the phrase I want to stick my long-necked Giraffe up your fluffy white bunny obscene?

0 讨论(0)
发布评论:

提交评论
- 加载中...
生来不讨喜

2021-02-09 22:02

Note that any NLP logic like this will be subject to attacks of "character replacement":

For example, I can write "hello" as "he11o", replacing L's with One's. Same with obscenities. So while there's no perfect answer, a "blacklist" approach of "bad words" might work. Watch out for false positives (I'd run my blacklist against a large book to see what comes up)

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页