Word break in languages without spaces between words (e.g., Asian)?

时光毁灭记忆、已成空白 提交于 2019-12-02 18:11:09

Word breaking for the languages mentioned require a linguistic approach, for example one that uses a dictionary along with an understanding of basic stemming rules.

I've heard of relatively successful full text search applications which simply split every single character as a separate word, in Chinese, simply applying the same "tokenization" of the search criteria supplied by the end-users. The search engine then provides a better ranking for the documents which supply the characters-words in the same order as the search criteria. I'm not sure this could be extended to Language such as Japanese, as the Hirakana and Katagana character sets make the text more akin to European languages with a short alphabet.

EDIT:
Resources
This word breaking problem, as well as related issues, is so non-trivial that whole books are written about it. See for example CJKV Information Processing (CJKV stands for Chinese, Japanese, Korean and Vietnamese; you may also use the CJK keyword, for in many texts, Vietnamese is not discussed). See also Word Breaking in Japanese is hard for a one-pager on this topic.
Understandingly, the majority of the material covering this topic is written in one of the underlying native languages, and is therefore of limited use for people without a relative fluency in these languages. For that reason, and also to help you validate the search engine once you start implementing the word breaker logic, you should seek the help of a native speaker or two.

Various ideas
Your idea of identifying characters which systematically imply a word break (say quotes, parenthesis, hyphen-like characters and such) is good, and that is probably one heuristic used by some of the professional grade word breakers. Yet, you should seek an authoritative source for such a list, rather than assembling one from scratch, based on anecdotal findings.
A related idea is to break words at Kana-to-Kanji transitions (but I'm guessing not the other way around), and possibly at Hiragana-to-Katakana or vice-versa transitions.
Unrelated to word-breaking proper, the index may [ -or may not- ;-)] benefit from the systematic conversion of every, say, hiragana character to the corresponding katakana character. Just an uneducated idea! I do not know enough about the Japanese language to know if that would help; intuitively, it would be loosely akin to the systematic conversion of accentuated letters and such to the corresponding non-accentuated letter, as practiced with several European languages.

Maybe the idea I mentioned earlier, of systematically indexing individual character (and of ranking the search results based on their proximity order-wise to the search criteria) can be slightly altered, for example by keeping consecutive kana characters together, and then some other rules... and produce a imperfect but practical enough search engine.

Do not be disappointed if this is not the case... As stated this is far from trivial, and it may save you time and money, in the long term, by taking a pause and reading a book or two. Another reason to try and learn more of the "theory" and best practices, is that at the moment you seem to be focused on word breaking but soon, the search engine may also benefit from stemming-awareness; indeed these two issues are, linguistically at least, related, and may benefit from being handled in tandem.

Good luck on this vexing but worthy endeavor.

B_W

One year later, and you probably don't need this any more but the code on the following page might have some hints for what you want(ed) to do:

http://www.geocities.co.jp/SiliconValley-PaloAlto/7043/spamfilter/japanese-tokenizer.el.txt

If you made any progress after the above posts in your own search I am sure others would be interested to know.

(Edited to say there is a better answer here: How to classify Japanese characters as either kanji or kana?)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!