问题
I need to highlight single words or phrases matching the $key (whole words, not substrings) in an UTF-8 $text. Such match has to be both case-insensitive and diacritic-insensitive. The highlighted text must remain as it was (including uppercase/lowercase characters and diacritical marks, if present).
The following expression achieved half the goal:
$text = preg_replace( "/\b($key)\b/i", '<div class="highlight">$1</div>', $text );
It's case insensitive and matches whole words but won't highlight the $text portions matching $key if such portions contain diacritical marks not present in $key. E.g. I'd like to have "Björn Källström" highlighted in $text passing $key = "bjorn kallstrom".
Any brilliant idea (using preg_replace or another PHP function) is welcome.
回答1:
One idea consists to transform the keys to patterns replacing all problematic characters with a character class:
$corr = ['a' => '[aàáâãäå]', 'o' => '[oòóôõö]',/* etc. */];
$key = 'bjorn kallstrom';
$pattern = '/\b' . strtr($key, $corr) . '\b/iu';
$text = preg_replace($pattern, '<em class="highlight">$0</em>', $text);
Note that since you are dealing with unicode characters, you need to use the u modifier to avoid unexpected behaviours in particular with word boundaries.
If your keys already contain accented characters, convert them to ascii first:
$key = 'björn kallstrom';
$key = iconv('UTF-8', 'ASCII//TRANSLIT', $key);
(If you obtain ?
in place of letters, that means that your locales are set to C or POSIX. In this case change them to en_US.UTF-8, or another one available in your system. see setlocale)
Also take a look at the very useful intl classes: Normalizer and Transliterator.
Notice: if you have several keys to highlight, do all in one shot. Sort the array by length (the longest first using mb_strlen
), use array_map
to transliterate the keys to ascii, and implode the array with |
. The goal is to obtain the pattern: '/\b(?:' . implode('|', $keys) . ')\b/iu'
with bj[oòóôõö]rn k[aàáâãäå]llstr[oòóôõö]m
before bj[oòóôõö]rn
alone (for instance).
回答2:
This is not possible with just a function call, you will have to implement it.
- extract the text from the HTML (
$document->documentElement->textContent
) - split the text into words and normalize them - keep the originals (
$words[$normalized][] = $original
) - basically this provides you with a list of variants for each normalized word. - split and normalize the search query
- compile RegEx patterns from the search query to match
((word1_v1|word1_v2)\s*(word2_v1|word2_v2))u
and validate(^(word1_v1|word1_v2)\s*(word2_v1|word2_v2)$)u
- Iterate over the text nodes in you HTML document
$xpath->evaluate('//text()')
- Use
preg_split()
to separate the text by the search strings, capture the delimiters (search matches) - Iterate over that list and add them as text nodes if the are not a search string match, otherwise add the HTML structure for a highlight
- remove the original text node.
来源:https://stackoverflow.com/questions/51931818/php-preg-replace-highlight-whole-words-matching-a-key-in-case-diacritic-insensi