A Viable Solution for Word Splitting Khmer?

后端 未结 3 1761
既然无缘
既然无缘 2021-01-01 19:52

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few

3条回答
  •  有刺的猬
    2021-01-01 20:41

    I think this is a good idea, as it is.

    I suggest you, when you have some experience with it, you add some rules, that can be very specific, for example, depending on word before, depending on word after, depending on surrounding words, depending on a sequence of words before the current word, just to enumerate the most frequent ones. You can find a set of rules in gposttl.sf.net project, which is a pos tagging project, in the file data/contextualrulefile.

    Rules should be used AFTER the statistics evaluation is finished, they make some fine tuning, and can improve accuracy remarkably.

提交回复
热议问题