问题
I'm doing some work on gender classification for a class. I've been using SVMLight with decent results, but I wanted to try some bayesian methods on my data as well. My dataset consists of text data, and I've done feature reduction to pare down the feature space to a more reasonable size for some of the bayesian methods. All of the instances are run through tf-idf and then normalized (through my own code).
I grabbed the sklearn toolkit because it was easy to integrate with my current codebase, but the results I'm getting from the GaussianNB are all of one class (-1 in this case), and the predicted probabilities are all [nan].
I've pasted some relevant code; I don't know if this is enough to go on, but I'm hoping that I'm just overlooking something obvious in using the sklearn api. I have a couple different feature sets that I've tried pushing through it, also with the same results. Same thing too using the training set and with cross-validation. Any thoughts? Could it be that my feature space simply too sparse for this to work? I have 300-odd instances, most of which have several hundred non-zero features.
class GNBLearner(BaseLearner):
def __init__(self, featureCount):
self.gnb = GaussianNB()
self.featureCount = featureCount
def train(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
Y = [0]*len(instances)
for i, inst in enumerate(instances):
for idx,val in inst.data:
X[i,idx-1] = val
Y[i] = inst.c
self.gnb.fit(X, Y)
def test(self, instances, params):
X = np.zeros( (len(instances), self.featureCount) )
for i, inst in enumerate(instances):
for idx,val in inst.data:
X[i,idx-1] = val
return self.gnb.predict(X)
def conf_mtx(self, res, test_set):
conf = [[0,0],[0,0]]
for r, x in xzip(res, test_set):
print "pred: %d, act: %d" % (r, x.c)
conf[(x.c+1)/2][(r+1)/2] += 1
return conf
回答1:
GaussianNB
is not a good fit for document classification at all, since tf-idf values are non-negative frequencies; use MultinomialNB
instead, and maybe try BernoulliNB
. scikit-learn comes with a document classification example that, incidentally, uses tf-idf weighting using the built-in TfidfTransformer
.
Don't expect miracles, though, as 300 samples is quite small for a training set (although for binary classification, it might just be enough to beat a "most frequent" baseline). YMMV.
Full disclosure: I'm one of the scikit-learn core devs and the main author of the current MultinomialNB
and BernoulliNB
code.
来源:https://stackoverflow.com/questions/16240721/sklearn-gaussiannb-bad-results-nan-probabilities