PHP preg_replace: highlight whole words matching a key in case/diacritic-insensitive way

问题

I need to highlight single words or phrases matching the $key (whole words, not substrings) in an UTF-8 $text. Such match has to be both case-insensitive and diacritic-insensitive. The highlighted text must remain as it was (including uppercase/lowercase characters and diacritical marks, if present).

The following expression achieved half the goal:

$text = preg_replace( "/\b($key)\b/i", '<div class="highlight">$1</div>', $text );

It's case insensitive and matches whole words but won't highlight the $text portions matching $key if such portions contain diacritical marks not present in $key. E.g. I'd like to have "Björn Källström" highlighted in $text passing $key = "bjorn kallstrom".

Any brilliant idea (using preg_replace or another PHP function) is welcome.

回答1:

One idea consists to transform the keys to patterns replacing all problematic characters with a character class:

$corr = ['a' => '[aàáâãäå]', 'o' => '[oòóôõö]',/* etc. */];

$key = 'bjorn kallstrom';

$pattern = '/\b' . strtr($key, $corr) . '\b/iu';

$text = preg_replace($pattern, '<em class="highlight">$0</em>', $text);

Note that since you are dealing with unicode characters, you need to use the u modifier to avoid unexpected behaviours in particular with word boundaries.

If your keys already contain accented characters, convert them to ascii first:

$key = 'björn kallstrom';
$key = iconv('UTF-8', 'ASCII//TRANSLIT', $key);

(If you obtain ? in place of letters, that means that your locales are set to C or POSIX. In this case change them to en_US.UTF-8, or another one available in your system. see setlocale)

Also take a look at the very useful intl classes: Normalizer and Transliterator.

Notice: if you have several keys to highlight, do all in one shot. Sort the array by length (the longest first using mb_strlen), use array_map to transliterate the keys to ascii, and implode the array with |. The goal is to obtain the pattern: '/\b(?:' . implode('|', $keys) . ')\b/iu' with bj[oòóôõö]rn k[aàáâãäå]llstr[oòóôõö]m before bj[oòóôõö]rn alone (for instance).

回答2:

This is not possible with just a function call, you will have to implement it.

extract the text from the HTML ($document->documentElement->textContent)
split the text into words and normalize them - keep the originals ($words[$normalized][] = $original) - basically this provides you with a list of variants for each normalized word.
split and normalize the search query
compile RegEx patterns from the search query to match ((word1_v1|word1_v2)\s*(word2_v1|word2_v2))u and validate (^(word1_v1|word1_v2)\s*(word2_v1|word2_v2)$)u
Iterate over the text nodes in you HTML document $xpath->evaluate('//text()')
Use preg_split() to separate the text by the search strings, capture the delimiters (search matches)
Iterate over that list and add them as text nodes if the are not a search string match, otherwise add the HTML structure for a highlight
remove the original text node.

来源：https://stackoverflow.com/questions/51931818/php-preg-replace-highlight-whole-words-matching-a-key-in-case-diacritic-insensi

标签

php

regex

preg-replace