Word break in languages without spaces between words (e.g., Asian)?

后端 未结 2 1174
孤城傲影
孤城傲影 2021-01-31 11:25

I\'d like to make MySQL full text search work with Japanese and Chinese text, as well as any other language. The problem is that these languages and probably others do not norma

相关标签:
2条回答
  • 2021-01-31 11:36

    Word breaking for the languages mentioned require a linguistic approach, for example one that uses a dictionary along with an understanding of basic stemming rules.

    I've heard of relatively successful full text search applications which simply split every single character as a separate word, in Chinese, simply applying the same "tokenization" of the search criteria supplied by the end-users. The search engine then provides a better ranking for the documents which supply the characters-words in the same order as the search criteria. I'm not sure this could be extended to Language such as Japanese, as the Hirakana and Katagana character sets make the text more akin to European languages with a short alphabet.

    EDIT:
    Resources
    This word breaking problem, as well as related issues, is so non-trivial that whole books are written about it. See for example CJKV Information Processing (CJKV stands for Chinese, Japanese, Korean and Vietnamese; you may also use the CJK keyword, for in many texts, Vietnamese is not discussed). See also Word Breaking in Japanese is hard for a one-pager on this topic.
    Understandingly, the majority of the material covering this topic is written in one of the underlying native languages, and is therefore of limited use for people without a relative fluency in these languages. For that reason, and also to help you validate the search engine once you start implementing the word breaker logic, you should seek the help of a native speaker or two.

    Various ideas
    Your idea of identifying characters which systematically imply a word break (say quotes, parenthesis, hyphen-like characters and such) is good, and that is probably one heuristic used by some of the professional grade word breakers. Yet, you should seek an authoritative source for such a list, rather than assembling one from scratch, based on anecdotal findings.
    A related idea is to break words at Kana-to-Kanji transitions (but I'm guessing not the other way around), and possibly at Hiragana-to-Katakana or vice-versa transitions.
    Unrelated to word-breaking proper, the index may [ -or may not- ;-)] benefit from the systematic conversion of every, say, hiragana character to the corresponding katakana character. Just an uneducated idea! I do not know enough about the Japanese language to know if that would help; intuitively, it would be loosely akin to the systematic conversion of accentuated letters and such to the corresponding non-accentuated letter, as practiced with several European languages.

    Maybe the idea I mentioned earlier, of systematically indexing individual character (and of ranking the search results based on their proximity order-wise to the search criteria) can be slightly altered, for example by keeping consecutive kana characters together, and then some other rules... and produce a imperfect but practical enough search engine.

    Do not be disappointed if this is not the case... As stated this is far from trivial, and it may save you time and money, in the long term, by taking a pause and reading a book or two. Another reason to try and learn more of the "theory" and best practices, is that at the moment you seem to be focused on word breaking but soon, the search engine may also benefit from stemming-awareness; indeed these two issues are, linguistically at least, related, and may benefit from being handled in tandem.

    Good luck on this vexing but worthy endeavor.

    0 讨论(0)
  • 2021-01-31 11:46

    One year later, and you probably don't need this any more but the code on the following page might have some hints for what you want(ed) to do:

    http://www.geocities.co.jp/SiliconValley-PaloAlto/7043/spamfilter/japanese-tokenizer.el.txt

    If you made any progress after the above posts in your own search I am sure others would be interested to know.

    (Edited to say there is a better answer here: How to classify Japanese characters as either kanji or kana?)

    0 讨论(0)
提交回复
热议问题