How to split a Thai sentence, which does not use spaces, into words?

后端 未结 4 1130
不思量自难忘°
不思量自难忘° 2021-02-19 06:46

How to split word from Thai sentence? English we can split word by space.

Example: I go to school, split = [\'I\', \'go\', \'to\' ,\'school\']

4条回答
  •  独厮守ぢ
    2021-02-19 07:12

    There are multiple ways to do 'Thai words tokenization'. One way is to use dictionary-based or pattern-based. In this case, the algorithm will go through characters and if it appears in the dictionary, we'll count as a word.

    Also, there are also recent libraries to tokenize Thai text where it trained Deep learning to tokenize Thai word on BEST corpus including rkcosmos/deepcut, pucktada/cutkum and more.

    Example usage of deepcut:

    import deepcut
    deepcut.tokenize('ฉันจะไปโรงเรียน')
    # output as ['ฉัน', 'จะ', 'ไป', 'โรง', 'เรียน']
    

提交回复
热议问题