问题
I am trying out a Multilabel Classification problem. My data looks like this
DocID Content Tags
1 some text here... [70]
2 some text here... [59]
3 some text here... [183]
4 some text here... [173]
5 some text here... [71]
6 some text here... [98]
7 some text here... [211]
8 some text here... [188]
. ............. .....
. ............. .....
. ............. .....
here is my code
traindf = pd.read_csv("mul.csv")
print "This is what our training data looks like:"
print traindf
t=TfidfVectorizer()
X=traindf["Content"]
y=traindf["Tags"]
print "Original Content"
print X
X=t.fit_transform(X)
print "Content After transformation"
print X
print "Original Tags"
print y
y=MultiLabelBinarizer().fit_transform(y)
print "Tags After transformation"
print y
print "Features extracted:"
print t.get_feature_names()
print "Scores of features extracted"
idf = t.idf_
print dict(zip(t.get_feature_names(), idf))
print "Splitting into training and validation sets..."
Xtrain, Xvalidate, ytrain, yvalidate = train_test_split(X, y, test_size=.5)
print "Training Set Content and Tags"
print Xtrain
print ytrain
print "Validation Set Content and Tags"
print Xvalidate
print yvalidate
print "Creating classifier"
clf = OneVsRestClassifier(LogisticRegression(penalty='l2', C=0.01))
clf.fit(Xtrain, ytrain)
predictions=clf.predict(Xvalidate)
print "Predicted Tags are:"
print predictions
print "Correct Tags on Validation Set are :"
print yvalidate
print "Accuracy on validation set: %.3f" % clf.score(Xvalidate,yvalidate)
the code runs fine but i keep getting these messages
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 288 is present in all training examples.
str(classes[c]))
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 304 is present in all training examples.
str(classes[c]))
X:\Anaconda2\lib\site-packages\sklearn\multiclass.py:70: UserWarning: Label not 340 is present in all training examples.
what does this mean? does it show that my data is not diverse enough?
回答1:
Some data mining algorithms have problems when some items are present in all or many records. This is for example an issue when doing association rule mining using the Apriori algorithm.
Whether it is a problem or not depends on the classifier. I don't know the particular classifier you're using, but here's an example when it could matter when fitting a decision tree with a maximum depth.
Say you are fitting a decision tree with max depth using Hunt's algorithm and the GINI index to determine the best split (see here for an explanation, slide 35 onwards). A first split could be on whether or not the record has label 288. If every record has this label, the GINI index will be optimal for such a split. This means that the first so many splits will be useless, because you're not actually splitting the training set (you're splitting in an empty set, without 288, and the set itself, with 288). So, the first so many levels of the tree are useless. If you then set a maximum depth, this could result in a low-accuracy decision tree.
In any case, the warning you get is not a problem with your code, at best with your data set. You should check whether or not the classifier you're using is sensitive to this kind of things – if it is, it may give better results when you filter out the labels that occur everywhere.
来源:https://stackoverflow.com/questions/34342122/python-sklearn-multilabel-classification-userwarning-label-not-226-is-present