match whole word only without regex

前端未结

关注

 4  1872

Since i cant use preg_match (UTF8 support is somehow broken, it works locally but breaks at production) i want to find another way to match word against blacklist. Problem is, i

相关标签:

4条回答

故里飘歌

2021-01-26 07:46
A simple way to use word boundaries with unicode properties:
```
preg_match('/(?:^|[^pL\pN_])(badword)(?:[^pL\pN_]|$)/u', $string);
```
In fact it's much more complicated, have a look at here.
0 讨论(0)
发布评论:

提交评论
- 加载中...

[愿得一人]

2021-01-26 07:49

If you want to mimic the \b modifier of regex you can try something like this:

$offset = 0;
$word = 'badword';
$matched = array();
while(($pos = strpos($string, $word, $offset)) !== false) {
    $leftBoundary = false;
    // If is the first char, it has a boundary on the right
    if ($pos === 0) {
       $leftBoundary = true;
    // Else, if it is on the middle of the string, we must check the previous char
    } elseif ($pos > 0 && in_array($string[$pos-1], array(' ', '-',...)) {
        $leftBoundary = true;
    }

    $rightBoundary = false;
    // If is the last char, it has a boundary on the right
    if ($pos === (strlen($string) - 1)) {
       $rightBoundary = true;
    // Else, if it is on the middle of the string, we must check the next char
    } elseif ($pos < (strlen($string) - 1) && in_array($string[$pos+1], array(' ', '-',...)) {
        $rightBoundary = true;
    }

    // If it has both boundaries, we add the index to the matched ones...
    if ($leftBoundary && $rightBoundary) {
        $matched[] = $pos;
    }

    $offset = $pos + strlen($word);
}

0 讨论(0)

小鲜肉

2021-01-26 08:01
Assuming you could do some pre-processing, you could use replace all your punctuation marks with white spaces and put everything in lowercase and then either:
- Use strpos with something like so strpos(' badword ', $string) in a while loop to keep on iterating through your entire document;
- Split the string at white spaces and compare each word with a list of bad words you have.
So if you where trying the first option, it would something like so (untested pseudo code)
```
$documet = body of text to process . ' ' 
$document.replace('!@#$%^&*(),./...', ' ')
$document.toLowerCase()
$arr_badWords = [...]
foreach($word in badwords)
{
    $badwordIndex = strpos(' ' . $word . ' ', $document)
    while(!badWordIndex)
    {
        //
        $badwordIndex = strpos($word, $document)
    }
}
```
EDIT: As per @jonhopkins suggestion, adding a white space at the end should cater for the scenario where there wanted word is at the end of the document and is not proceeded by a punctuation mark.
0 讨论(0)
发布评论:

提交评论
- 加载中...
盖世英雄少女心

2021-01-26 08:01
You can use strrpos() instead of strpos:

strrpos — Find the position of the last occurrence of a substring in a string
```
$string = "This is a string containing badwords and one badword";
var_dump(strrpos($string, 'badword'));
```
Output:
```
45
```
0 讨论(0)
发布评论:

提交评论
- 加载中...