What is the difference between x_test, x_train, y_test, y_train in sklearn?

微笑、不失礼 提交于 2020-07-20 06:34:55

问题


I'm learning sklearn and I didn't understand very good the difference and why use 4 outputs with the function train_test_split.

In the Documentation, I found some examples but it wasn't sufficient to end my doubts.

Does the code use the x_train to predict the x_test or use the x_train to predict the y_test?

What is the difference between train and test? Do I use train to predict the test or something similar?

I'm very confused about it. I will let below the example provided in the Documentation.

>>> import numpy as np  
>>> from sklearn.model_selection import train_test_split  
>>> X, y = np.arange(10).reshape((5, 2)), range(5)  
>>> X
array([[0, 1], 
       [2, 3],  
       [4, 5],  
       [6, 7],  
       [8, 9]])  
>>> list(y)  
[0, 1, 2, 3, 4] 
>>> X_train, X_test, y_train, y_test = train_test_split(  
...     X, y, test_size=0.33, random_state=42)  
...  
>>> X_train  
array([[4, 5], 
       [0, 1],  
       [6, 7]])  
>>> y_train  
[2, 0, 3]  
>>> X_test  
array([[2, 3], 
       [8, 9]])  
>>> y_test  
[1, 4]  
>>> train_test_split(y, shuffle=False)  
[[0, 1, 2], [3, 4]]

回答1:


Below is a dummy pandas.DataFrame for example:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

df = pd.DataFrame({'X1':[100,120,140,200,230,400,500,540,600,625],
                       'X2':[14,15,22,24,23,31,33,35,40,40],
                       'Y':[0,0,0,0,1,1,1,1,1,1]})

Here we have 3 columns, X1,X2,Y suppose X1 & X2 are your independent variables and 'Y' column is your dependent variable.

X = df[['X1','X2']]
y = df['Y']

With sklearn.model_selection.train_test_split you are creating 4 portions of data which will be used for fitting & predicting values.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4,random_state=42) 

X_train, X_test, y_train, y_test

Now

1). X_train - This includes your all independent variables,these will be used to train the model, also as we have specified the test_size = 0.4, this means 60% of observations from your complete data will be used to train/fit the model and rest 40% will be used to test the model.

2). X_test - This is remaining 40% portion of the independent variables from the data which will not be used in the training phase and will be used to make predictions to test the accuracy of the model.

3). y_train - This is your dependent variable which needs to be predicted by this model, this includes category labels against your independent variables, we need to specify our dependent variable while training/fitting the model.

4). y_test - This data has category labels for your test data, these labels will be used to test the accuracy between actual and predicted categories.

Now you can fit a model on this data, let's fit sklearn.linear_model.LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train) #This is where the training is taking place
y_pred_logreg = logreg.predict(X_test) #Making predictions to test the model on test data
print('Logistic Regression Train accuracy %s' % logreg.score(X_train, y_train)) #Train accuracy
#Logistic Regression Train accuracy 0.8333333333333334
print('Logistic Regression Test accuracy %s' % accuracy_score(y_pred_logreg, y_test)) #Test accuracy
#Logistic Regression Test accuracy 0.5
print(confusion_matrix(y_test, y_pred_logreg)) #Confusion matrix
print(classification_report(y_test, y_pred_logreg)) #Classification Report

You can read more about metrics here

Read more about data split here

Hope this helps:)




回答2:


You're supposed to train your classifier / regressor using your training set, and test / evaluate it using your testing set.

Your classifier / regressor uses x_train to predict y_pred and uses the difference between y_pred and y_train (through a loss function) to learn. Then you evaluate it by computing the loss between the predictions of x_test (that could also be named y_pred), and y_test.



来源:https://stackoverflow.com/questions/60636444/what-is-the-difference-between-x-test-x-train-y-test-y-train-in-sklearn

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!