问题
onehot_enc, BernoulliNB:
Here, I have used two different files for reviews and labels and I've used "train_test_split" to randomly split the data into 80% train data and 20% test data.
reviews.txt:
Colors & clarity is superb
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung
The picture is clear and beautiful
Picture is not clear
labels.txt:
positive
negative
positive
negative
My Code:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
with open("/Users/abc/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/abc/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=1)
bnbc = BernoulliNB(binarize=None)
bnbc.fit(onehot_enc.transform(X_train), y_train)
score = bnbc.score(onehot_enc.transform(X_test), y_test)
print("score of Naive Bayes algo is :" , score) // 90%
predicted_y = bnbc.predict(onehot_enc.transform(X_test))
tn, fp, fn, tp = confusion_matrix(y_test, predicted_y).ravel()
precision_score = tp / (tp + fp)
recall_score = tp / (tp + fn)
print("precision_score :" , precision_score) //92%
print("recall_score :" , recall_score) //97%
CountVectorizer, MultinomialNB:
Here, I've manually split the same data into train (80%) and test(20%).And I'm supplying these two csv files to the algorithm.
But, this is giving less accuracy compared to the above method. Can anyone help me out regarding the same ...
train_data.csv:
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
test_data.csv:
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
My Code:
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
def load_data(filename):
reviews = list()
labels = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
labels.append(line[1])
reviews.append(line[0])
return reviews, labels
X_train, y_train = load_data('/Users/abc/Sep_10/train_data.csv')
X_test, y_test = load_data('/Users/abc/Sep_10/test_data.csv')
vec = CountVectorizer()
X_train_transformed = vec.fit_transform(X_train)
X_test_transformed = vec.transform(X_test)
clf= MultinomialNB()
clf.fit(X_train_transformed, y_train)
score = clf.score(X_test_transformed, y_test)
print("score of Naive Bayes algo is :" , score) // 46%
y_pred = clf.predict(X_test_transformed)
print(confusion_matrix(y_test,y_pred))
print("Precision Score : ",precision_score(y_test, y_pred,average='micro'))//46%
print("Precision Score : ",recall_score(y_test, y_pred,average='micro')) // 46%
回答1:
The issue here is that you are using MultiLabelBinarizer
to:
onehot_enc.fit(reviews_tokens)`
before splitting into train and test, and test data is leaked to the model and hence higher accuracy.
On the other hand, when you use CountVectorizer
is only seeing the trained data and then ignoring the words that dont appear in the trained data, which may be valuable to model for classification.
So depending on the quantity of your data, this can make a huge difference. Anyways, your second technique (using CountVectorizer
) is correct and should be used in case of text data. MultiLabelBinarizer
and one-hot encoding in general should be used only for categorical data, not text data.
Can you share your complete data?
来源:https://stackoverflow.com/questions/52273090/how-can-accuracy-differs-between-one-hot-encode-and-count-vectorizer-for-the-sam