问题
I have to read a text document which contains both English and non-English (Malayalam specifically) languages in Python. The following I see:
>>>text_english = 'Today is a good day'
>>>text_non_english = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
Now, if I write a code to extract the first letter using
>>>print(text_english[0])
'T'
and when I run
>>>print(text_non_english[0])
�
To get the first letter, I have to write the following
>>>print(text_non_english[0:3])
ആ
Why this happens? My aim to extract the words in the text so that I can input it to the tfidf transformer. When I create the tfidf vocabulary from the Malayalam language, there are words which are two letters which is not correct. Actually they are part of the full words. What should i do so that the tfidf transformer takes the full Malayalam word for the transformation instead of taking two letters.
I used the following code for this
>>>useful_text_1[1:3] # contains both English and Malayalam text
>>>vectorizer = TfidfVectorizer(sublinear_tf=True,max_df=0.5,stop_words='english')
# Learn vocabulary and idf, return term-document matrix
>>>vect_2 = vectorizer.fit_transform(useful_text_1[1:3])
>>>vectorizer.vocabulary_
Some of the words in the vocabulary are as below:
ഷമ
സന
സഹ
ർക
ർത
The vocabulary is not correct. It is not considering the whole word. How to rectify this?
回答1:
You have to encode text in utf-8. But Malayalam language's letter contains 3 symbols, so you need to use unicode function:
In[36]: tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
In[37]: tne=unicode(tn, encoding='utf-8')
In[38]: print(tne[0])
ആ
回答2:
Using a dummy tokenizer actually worked for me
vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(), min_df=1)
>>> tn = 'ആരാണു സന്തോഷമാഗ്രഹിക്കാത്തത'
>>> vectorizer = TfidfVectorizer(tokenizer=lambda x: x.split(),min_df=1)
>>> vect_2 = vectorizer.fit_transform(tn.split())
>>> for x in vectorizer.vocabulary_:
... print x
...
സന്തോഷമാഗ്രഹിക്കാത്തത
ആരാണു
>>>
来源:https://stackoverflow.com/questions/36498617/can-i-use-tfidfvectorizer-in-scikit-learn-for-non-english-language-also-how-do