Why did NLTK NaiveBayes classifier misclassify one record?

问题

This is the first time I am building a sentiment analysis machine learning model using the nltk NaiveBayesClassifier in Python. I know it is too simple of a model, but it is just a first step for me and I will try tokenized sentences next time.

The real issue I have with my current model is: I have clearly labeled the word 'bad' as negative in the training data set (as you can see from the 'negative_vocab' variable). However, when I ran the NaiveBayesClassifier on each sentence (lower case) in the list ['awesome movie', ' i like it', ' it is so bad'], the classifier mistakenly labeled 'it is so bad' as positive.

INPUT:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad"
sentence = sentence.lower()
words = sentence.split('.')

def word_feat(word):
    return dict([(word,True)])
#NOTE THAT THE FUNCTION 'word_feat(word)' I WROTE HERE IS DIFFERENT FROM THE 'word_feat(words)' FUNCTION I DEFINED EARLIER. THIS FUNCTION IS USED TO ITERATE OVER EACH OF THE THREE ELEMENTS IN THE LIST ['awesome movie', ' i like it', ' it is so bad'].

for word in words:
    classResult = classifier.classify(word_feat(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1
    print(str(word) + ' is ' + str(classResult))
    print()

OUTPUT:

awesome movie is pos

i like it is pos

it is so bad is pos

To make sure the function 'word_feat(word)' iterates over each sentences instead of each word or letter, I did some diagnostic codes to see what is each element in 'word_feat(word)':

for word in words:
    print(word_feat(word))

And it printed out:

{'awesome movie': True}
{' i like it': True}
{' it is so bad': True}

So it seems like the function 'word_feat(word)' is correct?

Does anyone know why the classifier classified 'It is so bad' as positive? As mentioned before, I had clearly labeled the word 'bad' as negative in my training data.

回答1:

Here is the modified code for you

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names
from nltk.corpus import stopwords

positive_vocab = [ 'awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)' ]
negative_vocab = [ 'bad', 'terrible','useless', 'hate', ':(' ]
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not','it','so','really' ]

def word_feats(words):
    return dict([(word, True) for word in words])

positive_features_1 = [(word_feats(positive_vocab), 'pos')]
negative_features_1 = [(word_feats(negative_vocab), 'neg')]
neutral_features_1 = [(word_feats(neutral_vocab), 'neu')]

train_set = negative_features_1 + positive_features_1 + neutral_features_1

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0
sentence = "Awesome movie. I like it. It is so bad."
sentence = sentence.lower()
sentences = sentence.split('.')   # these are actually list of sentences

for sent in sentences:
    if sent != "":
        words = [word for word in sent.split(" ") if word not in stopwords.words('english')]
        classResult = classifier.classify(word_feats(words))
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
        print(str(sent) + ' --> ' + str(classResult))
        print

I modified where you are considering 'list of words' as an input to your classifier. But Actually you need to pass sentence one by one, which means you need to pass 'list of sentences'

Also, for each sentence, you need to pass 'words as features', which means you need to split the sentence on white-space character.

Also, if you want your classifier to work properly for sentiment analysis, you need to give less preference to "stop-words" like "it, they, is etc". As these words are not sufficient to decide if the sentence is positive, negative or neutral.

The above code gives below output

awesome movie --> pos

 i like it --> pos

 it is so bad --> neg

So for any classifier, the input format for training classifier and predicting classifier should be same. While training you are providing list of words, try to use the same method to convert your test set as well.

回答2:

This particular failure is because your word_feats() function expects a list of words (a tokenized sentence), but you pass it each word separately... so word_feats() iterates over its letters. You've built a classifier that classifies strings as positive or negative on the basis of the letters they contain.

You're probably in this predicament because you pay no attention to what you name your variables. In your main loop, none of the variables sentence, words, or word contain what their name claims. To understand and improve your program, start by naming things properly.

Bugs aside, this is not how you build a sentiment classifier. The training data should be a list of tokenized sentences (each labeled with its sentiment), not a list of individual words. Similarly, you classify tokenized sentences.

回答3:

Let me show a rewriting of your code. All I changed near the top was adding import re, as it is easier to tokenize with regexes. Everything else up to defining classifier is the same as your code.

I added one more test case (something really, really negative), but more importantly I used proper variable names - then it is much harder to get confused about what is going on:

test_data = "Awesome movie. I like it. It is so bad. I hate this terrible useless movie."
sentences = test_data.lower().split('.')

So sentences now contains 4 strings, each a single sentence. I left your word_feat() function unchanged.

For using the classifier I did quite a big rewrite:

for sentence in sentences:
    if(len(sentence) == 0):continue
    neg = 0
    pos = 0
    for word in re.findall(r"[\w']+", sentence):
        classResult = classifier.classify(word_feat(word))
        print(word, classResult)
        if classResult == 'neg':
            neg = neg + 1
        if classResult == 'pos':
            pos = pos + 1
    print("\n%s: %d vs -%d\n"%(sentence,pos,neg))

The outer loop is again descriptive, so that sentence contains one sentence.

I then have an inner loop where we classify each word in the sentence; I am using a regex to split the sentence up on whitespace and punctuation marks:

 for word in re.findall(r"[\w']+", sentence):
     classResult = classifier.classify(word_feat(word))

The rest is just basic adding up and reporting. I get this output:

awesome pos
movie neu

awesome movie: 1 vs -0

i pos
like pos
it pos

 i like it: 3 vs -0

it pos
is neu
so pos
bad neg

 it is so bad: 2 vs -1

i pos
hate neg
this pos
terrible neg
useless neg
movie neu

 i hate this terrible useless movie: 2 vs -3

I still get the same as you - "it is so bad" is considered positive. And with the extra debug lines we can see it is because "it" and "so" are considered positive words, and "bad" is the only negative word, so overall it is positive.

I suspect this is because it hadn't seen those words in its training data.

...yes, if I add "it" and "so" to the list of neutral words, I get "it is so bad: 0 vs -1".

As next things to try, I'd suggest:

Try with more training data; toy examples like this carry the risk that the noise will swamp the signal.
Look into removing stop words.

回答4:

You can try this code

from nltk.classify import NaiveBayesClassifier

def word_feats(words):
return dict([(word, True) for word in words])

positive_vocab = [ 'awesome', 'outstanding', 'fantastic','terrific','good','nice','great', ':)','love' ]
negative_vocab = [ 'bad', 'terrible','useless','hate',':(','kill','steal']
neutral_vocab = [ 'movie','the','sound','was','is','actors','did','know','words','not' ]

positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

classifier = NaiveBayesClassifier.train(train_set) 

# Predict
neg = 0
pos = 0

sentence = " Awesome movie, I like it :)"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
classResult = classifier.classify( word_feats(word))
if classResult == 'neg':
    neg = neg + 1
if classResult == 'pos':
    pos = pos + 1


print('Positive: ' + str(float(pos)/len(words)))
print('Negative: ' + str(float(neg)/len(words)))

results are: Positive: 0.7142857142857143 Negative: 0.14285714285714285

来源：https://stackoverflow.com/questions/48335460/why-did-nltk-naivebayes-classifier-misclassify-one-record

标签

nlp

classification

nltk

sentiment-analysis

naivebayes