Creating an effective word counter including Chinese/Japanese and other accented languages

冷暖自知 提交于 2019-12-05 19:41:57

You can take a look at the mbstring extension to work with UTF-8 strings.

mb_split() split a mb string using a regex pattern.

<?php 
printf("Counting words in: %s\n", $argv[1]);
mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8");
$r = mb_split(' ', $argv[1]); 
print_r($r); 
printf("Word count: %d\n", count($r));

$ php mb.php "foo bar"
Counting words in: foo bar
Array
(
    [0] => foo
    [1] => bar
)
Word count: 2


$ php mb.php "最適な ツール"
Counting words in: 最適な ツール
Array
(
    [0] => 最適な 
    [1] => ツール
)
Word count: 2

Note: I had to add 2 spaces between characters to get a correct count Fixed by setting mb_regex_encoding() & mb_internal_encoding() to UTF-8

However, in Chinese the concept of "words" doesn't exist (and may too in Japanese in some case), so you may never get a pertinent result in such way...)

You may need to write an algorithm using a dictionnary to determine which groups of characters is a "word"

There's the Kuromoji morphological analyzer for Japanese that can be used for word counting. Unfortunately it's written in Java, not PHP. Since porting it all to PHP is quite a huge task, I'd suggest writing a small wrapper around it so you can call it on the command line, or look into other PHP-Java bridges.

I don't know how applicable it is to languages other than Japanese. You may want to look into the Apache Tika project for similar such libraries.

I've had good results using the Intl extension's break iterator which tokenizes strings using locale-aware word boundaries. e.g:

<?php
$words = IntlBreakIterator::createWordInstance('zh');
$words->setText('最適なツール');

$count = 0;
foreach( $words as $offset ){
  if( IntlBreakIterator::WORD_NONE !== $words->getRuleStatus() ){
    $count++;
  }
}

printf("%u words", $count ); // 3 words

As I don't understand Chinese I can't verify that "3" is the correct answer. However, it produces accurate results for scripts I do understand, and I am trusting in the ICU library to be solid.

I also note that the passing of the "zh" parameter seems to make no difference to the result, but the argument is mandatory.

I'm running Intl PECL-3.0.0 and ICU version is 55.1. I discovered that my CentOS servers were running older versions than these and they didn't work for Chinese. So make sure you have the latest versions.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!