A Viable Solution for Word Splitting Khmer?

后端 未结 3 1762
既然无缘
既然无缘 2021-01-01 19:52

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few

相关标签:
3条回答
  • 2021-01-01 20:32

    The python with example filesaveas appears to recurse through the entire input string (for i in xrange(1, len(text) + 1)), stuffing the best results into the cache along the way; at each potential word, it then starts looking at the next word (which will in turn look at the word after that, and so on), and if that second word doesn't look very good, it won't save that particular one. It feels like O(N!) runtime, where N is the length of the input string.

    Super clever, but probably horrible for anything but simple tasks. What's the longest Khmer word you've got? I'm hoping < 20 characters.

    Maybe if you feed input into that example 20 characters at a time you can keep the runtime down to something approaching reasonable. Feed in the first 20 chars, suck off the first word, and then feed in the remaining input. If you re-use the cache it might do something silly like store partial words along the way.

    On a completely different tack, how many Khmer words are formed by concatenating two or more legal Khmer words? (similar to 'penknife' or 'basketball') If not too many, it might make sense to create a set of dictionaries, segregated by length of word, mapping from word to probability of use.

    Say, the longest Khmer word is 14 chars long; feed in 14 characters of input into the len14 dictionary, store the probability. Feed in 13 characters into len13, store the probability. Feed in 12 characters ... all the way down to 1 into len1. Then pick the interpretation with the highest probability, save the word, strip off that many characters, and try again.

    So it won't fail badly for inputs like "I" vs "Image", maybe longer inputs should have automatically inflated probabilities?

    Thanks for the fun question ;) I didn't know of any languages like this, pretty cool.

    0 讨论(0)
  • 2021-01-01 20:41

    I think this is a good idea, as it is.

    I suggest you, when you have some experience with it, you add some rules, that can be very specific, for example, depending on word before, depending on word after, depending on surrounding words, depending on a sequence of words before the current word, just to enumerate the most frequent ones. You can find a set of rules in gposttl.sf.net project, which is a pos tagging project, in the file data/contextualrulefile.

    Rules should be used AFTER the statistics evaluation is finished, they make some fine tuning, and can improve accuracy remarkably.

    0 讨论(0)
  • 2021-01-01 20:45

    The ICU library (that has Python and Java bindings) has a DictionaryBasedBreakIterator class that can be used for this.

    0 讨论(0)
提交回复
热议问题