Adding scikit-learn (sklearn) prediction to pandas data frame

前端未结

关注

 1  448

我寻月下人不归 2021-02-04 14:33

I am trying to add a sklearn prediction to a pandas dataframe, so that I can make a thorough evaluation of the prediction. The relavant piece of code is the following:

1条回答

闹比i (楼主)

2021-02-04 14:58

You're correct with your second line, df_total["pred_lin_regr"] = clf.predict(Xtest) and it's more efficient.

In that one you're taking the output of clf.predict(), which happens to be an array, and adding it to a dataframe. The output you're receiving from the array itself is in order to match Xtest, since that's the case, adding it to a numpy array will not change or alter that order.

Here's a little proof from this example:

Taking the following protion:

import numpy as np

import pandas as pd
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

print(regr.predict(diabetes_X_test))

df = pd.DataFrame(regr.predict(diabetes_X_test))

print(df)

The first print() function will give us a numpy array as expected:

[ 225.9732401   115.74763374  163.27610621  114.73638965  120.80385422
  158.21988574  236.08568105  121.81509832   99.56772822  123.83758651
  204.73711411   96.53399594  154.17490936  130.91629517   83.3878227
  171.36605897  137.99500384  137.99500384  189.56845268   84.3990668 ]

That order is identical to the second print() function in which we add the results to a dataframe:

             0
0   225.973240
1   115.747634
2   163.276106
3   114.736390
4   120.803854
5   158.219886
6   236.085681
7   121.815098
8    99.567728
9   123.837587
10  204.737114
11   96.533996
12  154.174909
13  130.916295
14   83.387823
15  171.366059
16  137.995004
17  137.995004
18  189.568453
19   84.399067

Rerunning the code for a portion of the test, will give us the same ordered results as such:

print(regr.predict(diabetes_X_test[0:5]))

df = pd.DataFrame(regr.predict(diabetes_X_test[0:5]))

print(df)

[ 225.9732401   115.74763374  163.27610621  114.73638965  120.80385422]
            0
0  225.973240
1  115.747634
2  163.276106
3  114.736390
4  120.803854

0 讨论(0)