A Viable Solution for Word Splitting Khmer?

后端未结

关注

 3  1766

既然无缘 2021-01-01 19:52

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few

3条回答

有刺的猬 (楼主)

2021-01-01 20:41

I think this is a good idea, as it is.

I suggest you, when you have some experience with it, you add some rules, that can be very specific, for example, depending on word before, depending on word after, depending on surrounding words, depending on a sequence of words before the current word, just to enumerate the most frequent ones. You can find a set of rules in gposttl.sf.net project, which is a pos tagging project, in the file data/contextualrulefile.

Rules should be used AFTER the statistics evaluation is finished, they make some fine tuning, and can improve accuracy remarkably.

0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...