str_word_count() for non-latin words?

前端 未结 4 1123
我寻月下人不归
我寻月下人不归 2021-01-05 14:12

im trying to count the number of words in variable written in non-latin language (Bulgarian). But it seems that str_word_count() is not counting non-latin words. The encodin

相关标签:
4条回答
  • 2021-01-05 14:22

    And here is the solution that come to my mind:

    $var = "текст на кирилица с пет думи";
    $array = explode(" ", $var);
    
    $i = 0;
    foreach($array as $item) 
        {
        if(strlen($item) > 2) $i++ ;
        }
    
    echo $i; // will return 5
    
    0 讨论(0)
  • 2021-01-05 14:24

    As it stated in str_word_count description

    'word' is defined as a locale dependent string

    Specify Bulgarian locale before calling str_word_count

    setlocale(LC_ALL, 'bg_BG');
    echo str_word_count($content);
    

    Read more about setlocale here.

    0 讨论(0)
  • 2021-01-05 14:30

    You may do it with regex:

    $str = "текст на кирилица";
    echo 'Number of words: '.count(preg_split('/\s+/', $str));
    

    here I'm defining word delimiter as space characters. If there may be something else that will be treated as word delimiter, you'll need to add it into your regex.

    Also, note, that since there's no utf characters in regex (not in string) - /u modifier isn't required. But if you'll want some utf characters to act as delimiter, you'll need to add this regex modifier.

    Update:

    If you want only cyrillic letters to be treated in words, you may use:

    $str = "текст 
    на 12453
    кирилица";
    echo 'Number of words: '.count(preg_split('/[^А-Яа-яЁё]+/u', $str));
    
    0 讨论(0)
  • 2021-01-05 14:33

    The best solution I found is to provide a list of characters for word count function:

    $text = 'текст на кирилице and on english too';
    $count = str_word_count($text, 0, 'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя');
    echo $count; // => 7
    
    0 讨论(0)
提交回复
热议问题