问题
I'm using this bad word detector/obfuscator in php (to be Adsense compliant). It shows the first letter of the bad word, and replaces the remaining letters with this character: ▪
It works fine, except when I'm using words that contain special characters in Spanish, for example: ñ, á, ó, etc.
This is my current code:
<?
function badwords_full($string, &$bad_references) {
static $bad_counter;
static $bad_list;
static $bad_list_q;
if(!isset($bad_counter)) {
$bad_counter = 0;
$bad_list = badwords_list();
$bad_list_q = array_map('preg_quote', $bad_list);
}
return preg_replace_callback('~('.implode('|', $bad_list_q).')~',
function($matches) use (&$bad_counter, &$bad_references) {
$bad_counter++;
$bad_references[$bad_counter] = $matches[0];
return substr($matches[0], 0, 1).str_repeat('▪', strlen($matches[0]) - 1);
}, $string);
}
function badwords_list() {
# spanish
$es = array(
"gallina",
"ñoño"
);
# english
$en = array(
"chicken",
"horse"
);
# join all languages
$list = array_merge($es, $en);
usort($list, function($a,$b) {
return strlen($b) < strlen($b);
});
return $list;
}
$bad = []; //holder for bad words
Test 1:
echo badwords_full('Hello, you are a chicken!', $bad);
Result 1:
Hello, you are a c▪▪▪▪▪▪! (works fine)
Test 2:
echo badwords_full('Hola en español eres un ñoño!', $bad);
Result 2:
Hola en español eres un �▪▪▪▪▪!
Any ideas on how to solve this issue? Thanks!
回答1:
You are splitting a multibyte character in half. Use mb_substr in place of substr.
return mb_substr($matches[0], 0, 1).str_repeat('▪', strlen($matches[0]) - 1);
https://3v4l.org/AnPJl
You also probably want to use mb_strlen in place of strlen.
来源:https://stackoverflow.com/questions/52461009/php-using-special-characters-in-bad-word-obfuscator