I\'m doing a multiclass text classification in Scikit-Learn. The dataset is being trained using the Multinomial Naive Bayes classifier having hundreds of labels. Here\'s an extr
Along with example input-output, here's the other function metrics_report_to_df(). Implementing precision_recall_fscore_support from Sklearn metrics should do:
# Generates classification metrics using precision_recall_fscore_support:
from sklearn import metrics
import pandas as pd
import numpy as np; from numpy import random
# Simulating true and predicted labels as test dataset:
np.random.seed(10)
y_true = np.array([0]*300 + [1]*700)
y_pred = np.random.randint(2, size=1000)
# Here's the custom function returning classification report dataframe:
def metrics_report_to_df(ytrue, ypred):
precision, recall, fscore, support = metrics.precision_recall_fscore_support(ytrue, ypred)
classification_report = pd.concat(map(pd.DataFrame, [precision, recall, fscore, support]), axis=1)
classification_report.columns = ["precision", "recall", "f1-score", "support"] # Add row w "avg/total"
classification_report.loc['avg/Total', :] = metrics.precision_recall_fscore_support(ytrue, ypred, average='weighted')
classification_report.loc['avg/Total', 'support'] = classification_report['support'].sum()
return(classification_report)
# Provide input as true_label and predicted label (from classifier)
classification_report = metrics_report_to_df(y_true, y_pred)
# Here's the output (metrics report transformed to dataframe )
In [1047]: classification_report
Out[1047]:
precision recall f1-score support
0 0.300578 0.520000 0.380952 300.0
1 0.700624 0.481429 0.570703 700.0
avg/Total 0.580610 0.493000 0.513778 1000.0
I have modified @kindjacket's answer. Try this:
import collections
def classification_report_df(report):
report_data = []
lines = report.split('\n')
del lines[-5]
del lines[-1]
del lines[1]
for line in lines[1:]:
row = collections.OrderedDict()
row_data = line.split()
row_data = list(filter(None, row_data))
row['class'] = row_data[0] + " " + row_data[1]
row['precision'] = float(row_data[2])
row['recall'] = float(row_data[3])
row['f1_score'] = float(row_data[4])
row['support'] = int(row_data[5])
report_data.append(row)
df = pd.DataFrame.from_dict(report_data)
df.set_index('class', inplace=True)
return df
You can just export that df to csv using pandas
While the previous answers are probably all working I found them a bit verbose. The following stores the individual class results as well as the summary line in a single dataframe. Not very sensitive to changes in the report but did the trick for me.
#init snippet and fake data
from io import StringIO
import re
import pandas as pd
from sklearn import metrics
true_label = [1,1,2,2,3,3]
pred_label = [1,2,2,3,3,1]
def report_to_df(report):
report = re.sub(r" +", " ", report).replace("avg / total", "avg/total").replace("\n ", "\n")
report_df = pd.read_csv(StringIO("Classes" + report), sep=' ', index_col=0)
return(report_df)
#txt report to df
report = metrics.classification_report(true_label, pred_label)
report_df = report_to_df(report)
#store, print, copy...
print (report_df)
Which gives the desired output:
Classes precision recall f1-score support
1 0.5 0.5 0.5 2
2 0.5 0.5 0.5 2
3 0.5 0.5 0.5 2
avg/total 0.5 0.5 0.5 6
As of scikit-learn
v0.20, the easiest way to convert a classification report to a pandas
Dataframe is by simply having the report returned as a dict
:
report = classification_report(y_test, y_pred, output_dict=True)
and then construct a Dataframe and transpose it:
df = pandas.DataFrame(report).transpose()
From here on, you are free to use the standard pandas
methods to generate your desired output formats (CSV, HTML, LaTeX, ...).
See also the documentation at https://scikit-learn.org/0.20/modules/generated/sklearn.metrics.classification_report.html
I had the same problem what i did was, paste the string output of metrics.classification_report into google sheets or excel and split the text into columns by custom 5 whitespaces.