How to get Top N predictions using sklearn's SGDClassifier

半腔热情 提交于 2019-12-11 04:13:51

问题


I try to set up a simple text classification task with the SGDClassifier of scikit and try to get the top N predictions back including their probabilities. As sample training data I have the three classes

  • apples
  • lemons
  • oranges

with one document per class:

  • in apples: 'apple and lemon'
  • in lemons: 'lemon and orange'
  • in oranges: 'orange and apple'

I now want to predict the three test docs 'apple', 'lemon' and 'orange' and would like to get the Top-2-Predictions per document, including their proabilities. My code so far looks like this:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
import numpy as np

train = load_files('data/test/')

text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
                     ('clf-svm', SGDClassifier(loss='modified_huber', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])
text_clf_svm = text_clf_svm.fit(train.data, train.target)

docs=['apple', 'orange', 'lemon']
predicted = text_clf_svm.predict(docs)
#Perform a Top 1 prediction
for doc, category in zip(docs, predicted):
    print('%r => %s' % (doc, train.target_names[category]))

# Perform a top 2 prediction
print(np.argsort(text_clf_svm.predict_proba(docs), axis=1)[-2:])

My output is as follows:

'apple' => apples
'orange' => lemons
'lemon' => lemons
[[1 2 0]
[0 1 2]]

I now have difficulties interpreting the data. What I actually want to get out is:

'apple' => apples (0.54...), lemons (0.43...)
'orange' => apples (0.48...), oranges (0.43...)
'lemon' => lemons (0.48...), oranges (0.43...)

Can somebody tell me how I can do this? Thank you in advance for your help!


回答1:


You are using argsort, what argsort does is that it gives you the indexes of the sorted array, so what you should do is as follows:

preds = text_clf_svm.predict_proba(docs)
preds_idx = np.argsort(preds, axis=1)[-2:]

for i,d in enumerate(docs):
    print d,"=>"
    for p in preds_idx[i]:
        print(text_clf_svm.classes_[p],"(",preds[i][p],")")

just reformat the print to your style and you will have what you want :)




回答2:


a quick add-on to @Imtinan's answer, as that answer orders your labels as 2nd highest and then 1st highest probable (ascending order). If instead you want it in descending order, just modify:

preds_idx = np.argsort(-preds, axis = 1)[ :2]



来源:https://stackoverflow.com/questions/52698815/how-to-get-top-n-predictions-using-sklearns-sgdclassifier

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!