I try to set up a simple text classification task with the SGDClassifier of scikit and try to get the top N predictions back including their probabilities. As sample training data I have the three classes
- apples
- lemons
- oranges
with one document per class:
- in apples: 'apple and lemon'
- in lemons: 'lemon and orange'
- in oranges: 'orange and apple'
I now want to predict the three test docs 'apple', 'lemon' and 'orange' and would like to get the Top-2-Predictions per document, including their proabilities. My code so far looks like this:
from sklearn.linear_model import SGDClassifier
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
import numpy as np
train = load_files('data/test/')
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()),
('clf-svm', SGDClassifier(loss='modified_huber', penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])
text_clf_svm = text_clf_svm.fit(train.data, train.target)
docs=['apple', 'orange', 'lemon']
predicted = text_clf_svm.predict(docs)
#Perform a Top 1 prediction
for doc, category in zip(docs, predicted):
print('%r => %s' % (doc, train.target_names[category]))
# Perform a top 2 prediction
print(np.argsort(text_clf_svm.predict_proba(docs), axis=1)[-2:])
My output is as follows:
'apple' => apples
'orange' => lemons
'lemon' => lemons
[[1 2 0]
[0 1 2]]
I now have difficulties interpreting the data. What I actually want to get out is:
'apple' => apples (0.54...), lemons (0.43...)
'orange' => apples (0.48...), oranges (0.43...)
'lemon' => lemons (0.48...), oranges (0.43...)
Can somebody tell me how I can do this? Thank you in advance for your help!
You are using argsort, what argsort does is that it gives you the indexes of the sorted array, so what you should do is as follows:
preds = text_clf_svm.predict_proba(docs)
preds_idx = np.argsort(preds, axis=1)[-2:]
for i,d in enumerate(docs):
print d,"=>"
for p in preds_idx[i]:
just reformat the print to your style and you will have what you want :)
a quick add-on to @Imtinan's answer, as that answer orders your labels as 2nd highest and then 1st highest probable (ascending order). If instead you want it in descending order, just modify:
preds_idx = np.argsort(-preds, axis = 1)[ :2]