What's the best way to determine the total number of words of a file in Java?

前端 未结 6 1226
一向
一向 2021-01-14 01:28

What is the best way to find the total number of words in a text file in Java? I\'m thinking Perl is the best on finding things such as this. If this is true then calling a

6条回答
  •  时光说笑
    2021-01-14 02:17

    Congratulations you have stumbled upon one of the biggest linguistic problems! What is a word? It is said that a word is the only word that actually means what it is. There is an entire field of linguistics devoted to words/units of meaning - Morphology.

    I assume that you question pertains to counting words in English. However, creating a language-neutral word counter/parser is next to impossible due to linguistic differences. For example, one might think that just processing the groups of characters separated by white space is sufficient. However, if you look at the following example in Japanese, you will see that that approach does not work:

    これは日本語の例文です。

    This example contains 3 distinct words and none of them are separated by spaces. Typically, Japanese word boundaries are parsed using a dictionary-based approach and there are a number of commercial libraries available for this. Are we lucky to have spaces in English! I believe that Indic languages, Chinese and Korean also have similar problems.

    If this solution is going to actually be deployed in any ways that multi-lingual input is possible, it will be important to be able to plug in different word counting methods depending upon the language being parsed.

    I think the first answer was a good answer because it uses Java's knowledge of Unicode whitespace values as delimiters. It tokenizes by matching using the following regex: \p{javaWhitespace}+

提交回复
热议问题