How to find the best degree of polynomials?

前端未结

关注

 3  637

I\'m new to Machine Learning and currently got stuck with this. First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regr

相关标签:

3条回答

后悔当初

2021-02-01 11:47

You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.

There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).

You can also play with how you decide to hold out your train/test/validation data.

import numpy as np
import matplotlib.pyplot as plt 

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

X = np.arange(100).reshape(100, 1)
y = X**4 + X**3 + X + 1

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rmses = []
degrees = np.arange(1, 10)
min_rmse, min_deg = 1e10, 0

for deg in degrees:

    # Train features
    poly_features = PolynomialFeatures(degree=deg, include_bias=False)
    x_poly_train = poly_features.fit_transform(x_train)

    # Linear regression
    poly_reg = LinearRegression()
    poly_reg.fit(x_poly_train, y_train)

    # Compare with test data
    x_poly_test = poly_features.fit_transform(x_test)
    poly_predict = poly_reg.predict(x_poly_test)
    poly_mse = mean_squared_error(y_test, poly_predict)
    poly_rmse = np.sqrt(poly_mse)
    rmses.append(poly_rmse)
    
    # Cross-validation of degree
    if min_rmse > poly_rmse:
        min_rmse = poly_rmse
        min_deg = deg

# Plot and present results
print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))
        
fig = plt.figure()
ax = fig.add_subplot(111)
ax.plot(degrees, rmses)
ax.set_yscale('log')
ax.set_xlabel('Degree')
ax.set_ylabel('RMSE')

This will print:

Best degree 4 with RMSE 1.27689038706e-08

Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.

0 讨论(0)

逝去的感伤

2021-02-01 12:03
This is where Bayesian model selection comes in really. This gives you the most likely model given both model complexity and data fit. I'm super tired so the quick answer is to use the BIC (Bayesian information criterion):
```
k = number of variables in the model
n = number of observations
sse = sum(residuals**2)
BIC = n*ln(sse/n) + k*ln(n) 
```
This BIC (or AIC etc) will give you the best model
0 讨论(0)
发布评论:

提交评论
- 加载中...

-上瘾入骨i

2021-02-01 12:10

In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.

Here is an example how to use this library:

Firstly let us define a method to sample random data:

def make_data(N, err=1.0, rseed=1):

    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 1. / (X.ravel() + 0.3)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

Build a pipeline:

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))

Create a data and a vector(X_test) for testing and visualisation purposes:

X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]

Define the GridSearchCV parameters:

param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)

Get the best parameters from our model:

model = grid.best_estimator_
model

Pipeline(memory=None,
     steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])

Fit the model with the X and y data and use the vector to predict the values:

y_test = model.fit(X, y).predict(X_test)

Visualize the result:

plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')

The best fit result

The full code snippet:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

def make_data(N, err=1.0, rseed=1):

    rng = np.random.RandomState(rseed)
    X = rng.rand(N, 1) ** 2
    y = 1. / (X.ravel() + 0.3)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=2, **kwargs):
    return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))


X, y = make_data(200)
X_test = np.linspace(-0.1, 1.1, 200)[:, None]

param_grid = {'polynomialfeatures__degree': np.arange(20),
'linearregression__fit_intercept': [True, False],
'linearregression__normalize': [True, False]}
grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
grid.fit(X, y)

model = grid.best_estimator_

y_test = model.fit(X, y).predict(X_test)

plt.scatter(X, y)
plt.plot(X_test.ravel(), y_test, 'r')

0 讨论(0)