I\'m new in PHP
I have an array like this
$suspiciousList = array(
array (\"word\" => \"badword1\", \"score\" => 400, \"type\" => 1),
array
As Jirka Helmich suggested you could remove whitespaces (and maybe other special chars) and then search the string to find words from your array.
public function searchForBadWords($strippedText) {
foreach($suspiciousList as $suspiciousPart) {
$count = substr_count($strippedText, $suspiciousPart['word']);
//you can use str_replace here or something, it depends what you want to achive
}
}
Problem is if you have words like blablabad wordblabla and you remove spaces to normal words could become bad words blablabadwordblabla
(know what I mean?) :D
Cheers
Edit: So Ahmad I see you just get words recognizing them by " " on the beginning/end(in shortcut). Maybe you should try to implement both methods, yours with single words and this above with substring searching. It depends also how much you care about performance. Maybe you should try do some reserches or sth to see how effective it is?:D
Anyway, you can strip whitespace characters and use (mb_
)substr_count()
but it leads to getting false positives.
@f1ames : I'm using these following code to make it array.
$words = mb_strtolower($words, 'UTF-8');
$words = $this->removeUniCharCategories($words);
$words = explode(" ",$words);
//Remove empty Array !
$words = array_filter($words);
foreach ($words as &$value) {
$newWords[] = $value;
}
$words = $newWords;
But i'm still find the best sollution
This question is a good start: How do you implement a good profanity filter? - and I agree with the conclusion, i.e. the detection will have always poor results.
I would try these approaches:
1) Simply detect words that are vulgar according to your dictionary.
2) Come up with a few heuristics like "continuous sequence of 'words' composed of one letter" (b a d w o r d) and use them to evaluate users' posts. Then you can compute expected number of vulgar words: \sum_i^{number of your heuristics} P_i * N_i
, where P_i
is the probability that word found with heuristic i
is really a vulgar one and N_i
is a number of words found by heuristics i
. I think the probabilistic approach is better than simply stating "this post does (not) contain a vulgar word".
3) Let a moderator decide if a post is really vulgar or not. Otherwise imperfection of your automatic replacing method will most probably get your users mad.
4) I think it's useless to look up words in an English (or Turkish?) dictionary in order to find words that are not really English words because people misspell words too much these days.