I am trying to evaluate a multiple linear regression model. I have a data set like this :
You can turn the dataframe into a matrix using the method as_matrix
directly on the dataframe object. You might need to specify the columns which you are interested in X=df[['x1','x2','X3']].as_matrix()
where the different x's are the column names.
For the y variables you can use y = df['ground_truth'].values
to get an array.
Here is an example with some randomly generated data:
import numpy as np
#create a 5X5 dataframe
df = pd.DataFrame(np.random.random_integers(0, 100, (5, 5)), columns = ['X1','X2','X3','X4','y'])
calling as_matrix()
on df
returns a numpy.ndarray
object
X = df[['X1','X2','X3','X4']].as_matrix()
Calling values
returns a numpy.ndarray
from a pandas series
y =df['y'].values
Notice: You might get a warning saying:FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
To fix it use values
instead of as_matrix
as shown below
X = df[['X1','X2','X3','X4']].values
y = broken_df.ground_truth.values
X = broken_df.drop('ground_truth', axis=1).values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)
print(linreg.score(X_test, y_test)
print(classification_report(y_test, y_pred))