How can accuracy differs between one_hot_encode and count_vectorizer for the same dataset?

荒凉一梦 提交于 2019-12-13 03:30:43

问题


onehot_enc, BernoulliNB:

Here, I have used two different files for reviews and labels and I've used "train_test_split" to randomly split the data into 80% train data and 20% test data.

reviews.txt:

Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear

labels.txt:

positive
negative
positive
negative

My Code:

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix

with open("/Users/abc/reviews.txt") as f:
    reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]

onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)


X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)


bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)

score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%

predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)

print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%

CountVectorizer, MultinomialNB:

Here, I've manually split the same data into train (80%) and test(20%).And I'm supplying these two csv files to the algorithm.

But, this is giving less accuracy compared to the above method. Can anyone help me out regarding the same ...

train_data.csv:

   review,label
    Colors & clarity is superb,positive
    Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative

test_data.csv:

 review,label
    The picture is clear and beautiful,positive
    Picture is not clear,negative

My Code:

from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score


def load_data(filename):
    reviews = list()
    labels = list()
    with open(filename) as file:
        file.readline()
        for line in file:
            line = line.strip().split(',')
            labels.append(line[1])
            reviews.append(line[0])

    return reviews, labels

X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')

vec = CountVectorizer() 

X_train_transformed =  vec.fit_transform(X_train) 

X_test_transformed = vec.transform(X_test)

clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)

score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%

y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))

print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%

回答1:


The issue here is that you are using MultiLabelBinarizer to:

onehot_enc.fit(reviews_tokens)` 

before splitting into train and test, and test data is leaked to the model and hence higher accuracy.

On the other hand, when you use CountVectorizer is only seeing the trained data and then ignoring the words that dont appear in the trained data, which may be valuable to model for classification.

So depending on the quantity of your data, this can make a huge difference. Anyways, your second technique (using CountVectorizer) is correct and should be used in case of text data. MultiLabelBinarizer and one-hot encoding in general should be used only for categorical data, not text data.

Can you share your complete data?



来源:https://stackoverflow.com/questions/52273090/how-can-accuracy-differs-between-one-hot-encode-and-count-vectorizer-for-the-sam

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!