Tokenization of Arabic words using NLTK

↘锁芯ラ 提交于 2019-11-29 00:51:28

问题


I'm using NLTK word_tokenizer to split a sentence into words.

I want to tokenize this sentence:

في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء 

The code I'm writing is:

import re
import nltk

lex = u" في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء"

wordsArray = nltk.word_tokenize(lex)
print " ".join(wordsArray)

The problem is that the word_tokenize function doesn't split by words. Instead, it splits by letters so that the output is:

"ف ي _ ب ي ت ن ا ك ل ش ي ل م ا ت ح ت ا ج ه ي ض ي ع ... ا د و ر ع ل ى ش ا ح ن ف ج أ ة ي خ ت ف ي .. ل د ر ج ة ا ن ي ا س و ي ن ف س ي ا د و ر ش ي ء"

Any ideas ?

What I've reached so far:

By trying the text in here, it appeared to be tokenized by letters. Also, however, other tokenizers tokenized it correctly. Does that mean that word_tokenize is for English only? Does that go for most of NLTK functions?


回答1:


I always recommend using nltk.tokenize.wordpunct_tokenize. You can try out many of the NLTK tokenizers at http://text-processing.com/demo/tokenize/ and see for yourself.




回答2:


this is the output i get with my code, but i recall unicode doesn't go well in python 2 and I used 3.5

nltk.word_tokenize('في_بيتنا كل شي لما تحتاجه يضيع ...ادور على شاحن فجأة يختفي ..لدرجة اني اسوي نفسي ادور شيء ')

['في_بيتنا', 'كل', 'شي', 'لما', 'تحتاجه', 'يضيع', '...', 'ادور', 'على', 'شاحن', 'فجأة', 'يختفي', '..لدرجة', 'اني', 'اسوي', 'نفسي', 'ادور', 'شيء']



来源:https://stackoverflow.com/questions/13035595/tokenization-of-arabic-words-using-nltk

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!