I am trying to add a sklearn prediction to a pandas dataframe, so that I can make a thorough evaluation of the prediction. The relavant piece of code is the following:
You're correct with your second line, df_total["pred_lin_regr"] = clf.predict(Xtest)
and it's more efficient.
In that one you're taking the output of clf.predict(), which happens to be an array, and adding it to a dataframe. The output you're receiving from the array itself is in order to match Xtest
, since that's the case, adding it to a numpy array will not change or alter that order.
Here's a little proof from this example:
Taking the following protion:
import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
# Load the diabetes dataset
diabetes = datasets.load_diabetes()
# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]
# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
print(regr.predict(diabetes_X_test))
df = pd.DataFrame(regr.predict(diabetes_X_test))
print(df)
The first print()
function will give us a numpy array as expected:
[ 225.9732401 115.74763374 163.27610621 114.73638965 120.80385422
158.21988574 236.08568105 121.81509832 99.56772822 123.83758651
204.73711411 96.53399594 154.17490936 130.91629517 83.3878227
171.36605897 137.99500384 137.99500384 189.56845268 84.3990668 ]
That order is identical to the second print()
function in which we add the results to a dataframe:
0
0 225.973240
1 115.747634
2 163.276106
3 114.736390
4 120.803854
5 158.219886
6 236.085681
7 121.815098
8 99.567728
9 123.837587
10 204.737114
11 96.533996
12 154.174909
13 130.916295
14 83.387823
15 171.366059
16 137.995004
17 137.995004
18 189.568453
19 84.399067
Rerunning the code for a portion of the test, will give us the same ordered results as such:
print(regr.predict(diabetes_X_test[0:5]))
df = pd.DataFrame(regr.predict(diabetes_X_test[0:5]))
print(df)
[ 225.9732401 115.74763374 163.27610621 114.73638965 120.80385422]
0
0 225.973240
1 115.747634
2 163.276106
3 114.736390
4 120.803854