Adding scikit-learn (sklearn) prediction to pandas data frame

前端 未结 1 449
我寻月下人不归
我寻月下人不归 2021-02-04 14:33

I am trying to add a sklearn prediction to a pandas dataframe, so that I can make a thorough evaluation of the prediction. The relavant piece of code is the following:



        
相关标签:
1条回答
  • 2021-02-04 14:58

    You're correct with your second line, df_total["pred_lin_regr"] = clf.predict(Xtest) and it's more efficient.

    In that one you're taking the output of clf.predict(), which happens to be an array, and adding it to a dataframe. The output you're receiving from the array itself is in order to match Xtest, since that's the case, adding it to a numpy array will not change or alter that order.

    Here's a little proof from this example:

    Taking the following protion:

    import numpy as np
    
    import pandas as pd
    from sklearn import datasets, linear_model
    
    # Load the diabetes dataset
    diabetes = datasets.load_diabetes()
    
    # Use only one feature
    diabetes_X = diabetes.data[:, np.newaxis, 2]
    
    # Split the data into training/testing sets
    diabetes_X_train = diabetes_X[:-20]
    diabetes_X_test = diabetes_X[-20:]
    
    # Split the targets into training/testing sets
    diabetes_y_train = diabetes.target[:-20]
    diabetes_y_test = diabetes.target[-20:]
    
    # Create linear regression object
    regr = linear_model.LinearRegression()
    
    # Train the model using the training sets
    regr.fit(diabetes_X_train, diabetes_y_train)
    
    print(regr.predict(diabetes_X_test))
    
    df = pd.DataFrame(regr.predict(diabetes_X_test))
    
    print(df)
    

    The first print() function will give us a numpy array as expected:

    [ 225.9732401   115.74763374  163.27610621  114.73638965  120.80385422
      158.21988574  236.08568105  121.81509832   99.56772822  123.83758651
      204.73711411   96.53399594  154.17490936  130.91629517   83.3878227
      171.36605897  137.99500384  137.99500384  189.56845268   84.3990668 ]
    

    That order is identical to the second print() function in which we add the results to a dataframe:

                 0
    0   225.973240
    1   115.747634
    2   163.276106
    3   114.736390
    4   120.803854
    5   158.219886
    6   236.085681
    7   121.815098
    8    99.567728
    9   123.837587
    10  204.737114
    11   96.533996
    12  154.174909
    13  130.916295
    14   83.387823
    15  171.366059
    16  137.995004
    17  137.995004
    18  189.568453
    19   84.399067
    

    Rerunning the code for a portion of the test, will give us the same ordered results as such:

    print(regr.predict(diabetes_X_test[0:5]))
    
    df = pd.DataFrame(regr.predict(diabetes_X_test[0:5]))
    
    print(df)
    
    [ 225.9732401   115.74763374  163.27610621  114.73638965  120.80385422]
                0
    0  225.973240
    1  115.747634
    2  163.276106
    3  114.736390
    4  120.803854
    
    0 讨论(0)
提交回复
热议问题