PHP preg_replace: highlight whole words matching a key in case/diacritic-insensitive way

只谈情不闲聊 提交于 2020-01-16 15:47:04

问题


I need to highlight single words or phrases matching the $key (whole words, not substrings) in an UTF-8 $text. Such match has to be both case-insensitive and diacritic-insensitive. The highlighted text must remain as it was (including uppercase/lowercase characters and diacritical marks, if present).

The following expression achieved half the goal:

$text = preg_replace( "/\b($key)\b/i", '<div class="highlight">$1</div>', $text );

It's case insensitive and matches whole words but won't highlight the $text portions matching $key if such portions contain diacritical marks not present in $key. E.g. I'd like to have "Björn Källström" highlighted in $text passing $key = "bjorn kallstrom".

Any brilliant idea (using preg_replace or another PHP function) is welcome.


回答1:


One idea consists to transform the keys to patterns replacing all problematic characters with a character class:

$corr = ['a' => '[aàáâãäå]', 'o' => '[oòóôõö]',/* etc. */];

$key = 'bjorn kallstrom';

$pattern = '/\b' . strtr($key, $corr) . '\b/iu';

$text = preg_replace($pattern, '<em class="highlight">$0</em>', $text);

Note that since you are dealing with unicode characters, you need to use the u modifier to avoid unexpected behaviours in particular with word boundaries.

If your keys already contain accented characters, convert them to ascii first:

$key = 'björn kallstrom';
$key = iconv('UTF-8', 'ASCII//TRANSLIT', $key);

(If you obtain ? in place of letters, that means that your locales are set to C or POSIX. In this case change them to en_US.UTF-8, or another one available in your system. see setlocale)

Also take a look at the very useful intl classes: Normalizer and Transliterator.

Notice: if you have several keys to highlight, do all in one shot. Sort the array by length (the longest first using mb_strlen), use array_map to transliterate the keys to ascii, and implode the array with |. The goal is to obtain the pattern: '/\b(?:' . implode('|', $keys) . ')\b/iu' with bj[oòóôõö]rn k[aàáâãäå]llstr[oòóôõö]m before bj[oòóôõö]rn alone (for instance).




回答2:


This is not possible with just a function call, you will have to implement it.

  1. extract the text from the HTML ($document->documentElement->textContent)
  2. split the text into words and normalize them - keep the originals ($words[$normalized][] = $original) - basically this provides you with a list of variants for each normalized word.
  3. split and normalize the search query
  4. compile RegEx patterns from the search query to match ((word1_v1|word1_v2)\s*(word2_v1|word2_v2))u and validate (^(word1_v1|word1_v2)\s*(word2_v1|word2_v2)$)u
  5. Iterate over the text nodes in you HTML document $xpath->evaluate('//text()')
  6. Use preg_split() to separate the text by the search strings, capture the delimiters (search matches)
  7. Iterate over that list and add them as text nodes if the are not a search string match, otherwise add the HTML structure for a highlight
  8. remove the original text node.


来源:https://stackoverflow.com/questions/51931818/php-preg-replace-highlight-whole-words-matching-a-key-in-case-diacritic-insensi

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!