问题
I am using a Linear Regression classifier to predict some values. I already figured the basic part of the out and now it looks like this:
import time as ti
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import csv
from sklearn.datasets import load_boston
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from scipy.interpolate import *
import datetime
data = pd.read_csv(r"C:\Users\simon\Desktop\Datenbank\visualisierung\includes\csv.csv")
x = np.array(data["day"])
y = np.array(data["balance"])
reg = linear_model.LinearRegression()
X_train, X_test, y_train, y_test, i_train, i_test = train_test_split(x, y, data.index, test_size=0.2, random_state=4)
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
i_train = i_train.values.reshape(-1, 1)
i_test = i_test.values.reshape(-1, 1)
reg.fit(i_train, y_train)
print(reg.score(i_test, y_test))
252128,6/6/19
252899,7/6/19
253670,8/6/19
254441,9/6/19
I have 27 rows of those in total.
It doesn't work for some reason.
UndefinedMetricWarning: R^2 score is not well-defined with less than two samples.
The dtypes and shapes are:
X_train, X_test = object #dtype
X_train = (21,) #shape
X_test = (6,) #shape
y_train, y_test = int64 #dtype
y_train, y_test = (1, 21) #shape
i_train, i_test = int64 #dtype
i_train, i_test = (1, 21) #shape
X_train, X_test, y_train, y_test, i_train, i_test are all a:
<class 'numpy.ndarray'>
I could imagine that thats because i dont have enough examples.
Why does this happen and how can i prevent it?
回答1:
As suggested by sklearn documentation:
X : array-like or sparse matrix, shape (n_samples, n_features)
Training data
y : array_like, shape (n_samples, n_targets)
Target values. Will be cast to X’s dtype if necessary
Therefore, if your dataset consists of only 1 feature, you need to reshape your training and test sets using:
X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
y_train = y_train.reshape(-1, 1)
y_test = y_test.reshape(-1, 1)
and the rest of your code should work properly.
After OP's specifications, the dataset seems to be a time series. Linear Regression is not going to properly model your data, but, as a toy example to have fun with, you can convert dates to POSIX time, split the data, and test different algorithms.
Assuming you dataset:
balance day
0 252128 6/6/19
1 252899 7/6/19
2 253670 8/6/19
3 254441 9/6/19
4 255944 10/6/19
5 256041 11/6/19
6 256670 12/6/19
7 257441 13/6/19
8 258128 14/6/19
9 258899 15/6/19
10 259670 16/6/19
11 260241 17/6/19
12 260444 18/6/19
13 260341 19/6/19
14 260670 20/6/19
15 261441 21/6/19
you can modify the code this way:
import pandas as pd
from sklearn import linear_model
data = pd.read_csv('csv.csv')
X = pd.to_datetime(data['day'])
# convert to POSIX time by dividing by 10**9
X = X.astype("int64").values.reshape(-1, 1) // 10**9
y = data['balance']
# split the data
X_train = X[:12]
y_train = y[:12]
X_test = X[-4:]
y_test = y[-4:]
reg.fit(X_train, y_train)
print(reg.score(X_test, y_test))
reg.predict(X_test)
What do you get? A very poor solution.
来源:https://stackoverflow.com/questions/56726381/r2-score-is-not-well-defined-with-less-than-two-samples-python-sklearn