CountVectorizer with Pandas dataframe

£可爱£侵袭症+ 提交于 2021-02-19 01:06:38

问题


I am using scikit-learn for text processing, but my CountVectorizer isn't giving the output I expect.

My CSV file looks like:

"Text";"label"
"Here is sentence 1";"label1"
"I am sentence two";"label2"
...

and so on.

I want to use Bag-of-Words first in order to understand how SVM in python works:

import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer

data = pd.read_csv(open('myfile.csv'),sep=';')

target = data["label"]
del data["label"]

# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
X_train_counts.shape 
count_vect.vocabulary_.get(u'algorithm')

But when I do print(X_train_counts.shape) I see the output is only (1,1), whereas I have 1048 rows with sentences.

What I am doing wrong? I am following this tutorial.

(Also the output of count_vect.vocabulary_.get(u'algorithm') is None.)


回答1:


The problem is in count_vect.fit_transform(data). The function expects an iterable that yields strings. Unfortunately, these are the wrong strings, which can be verified with a simple example.

for x in data:
    print(x)
# Text

Only the column names get printed; iterating gives columns instead of the values of data['Text']. You should do this:

X_train_counts = count_vect.fit_transform(data.Text)
X_train_counts.shape 
# (2, 5)
count_vect.vocabulary_
# {'am': 0, 'here': 1, 'is': 2, 'sentence': 3, 'two': 4}


来源:https://stackoverflow.com/questions/44083683/countvectorizer-with-pandas-dataframe

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!