sklearn GaussianNB - bad results, [nan] probabilities

问题

I'm doing some work on gender classification for a class. I've been using SVMLight with decent results, but I wanted to try some bayesian methods on my data as well. My dataset consists of text data, and I've done feature reduction to pare down the feature space to a more reasonable size for some of the bayesian methods. All of the instances are run through tf-idf and then normalized (through my own code).

I grabbed the sklearn toolkit because it was easy to integrate with my current codebase, but the results I'm getting from the GaussianNB are all of one class (-1 in this case), and the predicted probabilities are all [nan].

I've pasted some relevant code; I don't know if this is enough to go on, but I'm hoping that I'm just overlooking something obvious in using the sklearn api. I have a couple different feature sets that I've tried pushing through it, also with the same results. Same thing too using the training set and with cross-validation. Any thoughts? Could it be that my feature space simply too sparse for this to work? I have 300-odd instances, most of which have several hundred non-zero features.

class GNBLearner(BaseLearner):
    def __init__(self, featureCount):
        self.gnb = GaussianNB()
        self.featureCount = featureCount

    def train(self, instances, params):
        X = np.zeros( (len(instances), self.featureCount) )
        Y = [0]*len(instances)
        for i, inst in enumerate(instances):
            for idx,val in inst.data:
                X[i,idx-1] = val
            Y[i] = inst.c
        self.gnb.fit(X, Y)

    def test(self, instances, params):
        X = np.zeros( (len(instances), self.featureCount) )
        for i, inst in enumerate(instances):
            for idx,val in inst.data:
                X[i,idx-1] = val
        return self.gnb.predict(X)

    def conf_mtx(self, res, test_set):
        conf = [[0,0],[0,0]]
        for r, x in xzip(res, test_set):
            print "pred: %d, act: %d" % (r, x.c)
            conf[(x.c+1)/2][(r+1)/2] += 1
        return conf

回答1:

GaussianNB is not a good fit for document classification at all, since tf-idf values are non-negative frequencies; use MultinomialNB instead, and maybe try BernoulliNB. scikit-learn comes with a document classification example that, incidentally, uses tf-idf weighting using the built-in TfidfTransformer.

Don't expect miracles, though, as 300 samples is quite small for a training set (although for binary classification, it might just be enough to beat a "most frequent" baseline). YMMV.

Full disclosure: I'm one of the scikit-learn core devs and the main author of the current MultinomialNB and BernoulliNB code.

来源：https://stackoverflow.com/questions/16240721/sklearn-gaussiannb-bad-results-nan-probabilities

标签

python

machine-learning

scikit-learn

bayesian