How to split word from Thai sentence? English we can split word by space.
Example: I go to school
, split = [\'I\', \'go\', \'to\' ,\'school\']
There are multiple ways to do 'Thai words tokenization'. One way is to use dictionary-based or pattern-based. In this case, the algorithm will go through characters and if it appears in the dictionary, we'll count as a word.
Also, there are also recent libraries to tokenize Thai text where it trained Deep learning to tokenize Thai word on BEST corpus including rkcosmos/deepcut, pucktada/cutkum and more.
Example usage of deepcut
:
import deepcut
deepcut.tokenize('ฉันจะไปโรงเรียน')
# output as ['ฉัน', 'จะ', 'ไป', 'โรง', 'เรียน']