How to find the best degree of polynomials?

前端 未结 3 634
天涯浪人
天涯浪人 2021-02-01 11:45

I\'m new to Machine Learning and currently got stuck with this. First I use linear regression to fit the training set but get very large RMSE. Then I tried using polynomial regr

相关标签:
3条回答
  • 2021-02-01 11:47

    You should provide the data for X/Y next time, or something dummy, it'll be faster and provide you with a specific solution. For now I've created a dummy equation of the form y = X**4 + X**3 + X + 1.

    There are many ways you can improve on this, but a quick iteration to find the best degree is to simply fit your data on each degree and pick the degree with the best performance (e.g., lowest RMSE).

    You can also play with how you decide to hold out your train/test/validation data.

    import numpy as np
    import matplotlib.pyplot as plt 
    
    from sklearn.linear_model import LinearRegression
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.metrics import mean_squared_error
    from sklearn.model_selection import train_test_split
    
    X = np.arange(100).reshape(100, 1)
    y = X**4 + X**3 + X + 1
    
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
    
    rmses = []
    degrees = np.arange(1, 10)
    min_rmse, min_deg = 1e10, 0
    
    for deg in degrees:
    
        # Train features
        poly_features = PolynomialFeatures(degree=deg, include_bias=False)
        x_poly_train = poly_features.fit_transform(x_train)
    
        # Linear regression
        poly_reg = LinearRegression()
        poly_reg.fit(x_poly_train, y_train)
    
        # Compare with test data
        x_poly_test = poly_features.fit_transform(x_test)
        poly_predict = poly_reg.predict(x_poly_test)
        poly_mse = mean_squared_error(y_test, poly_predict)
        poly_rmse = np.sqrt(poly_mse)
        rmses.append(poly_rmse)
        
        # Cross-validation of degree
        if min_rmse > poly_rmse:
            min_rmse = poly_rmse
            min_deg = deg
    
    # Plot and present results
    print('Best degree {} with RMSE {}'.format(min_deg, min_rmse))
            
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(degrees, rmses)
    ax.set_yscale('log')
    ax.set_xlabel('Degree')
    ax.set_ylabel('RMSE')
    

    This will print:

    Best degree 4 with RMSE 1.27689038706e-08

    Alternatively, you could also build a new class that carries out Polynomial fitting, and pass that to GridSearchCV with a set of parameters.

    0 讨论(0)
  • 2021-02-01 12:03

    This is where Bayesian model selection comes in really. This gives you the most likely model given both model complexity and data fit. I'm super tired so the quick answer is to use the BIC (Bayesian information criterion):

    k = number of variables in the model
    n = number of observations
    sse = sum(residuals**2)
    BIC = n*ln(sse/n) + k*ln(n) 
    

    This BIC (or AIC etc) will give you the best model

    0 讨论(0)
  • 2021-02-01 12:10

    In my opinion, the best way to find an optimal curve fitting degree or in general a fitting model is to use the GridSearchCV module from the scikit-learn library.

    Here is an example how to use this library:

    Firstly let us define a method to sample random data:

    def make_data(N, err=1.0, rseed=1):
    
        rng = np.random.RandomState(rseed)
        X = rng.rand(N, 1) ** 2
        y = 1. / (X.ravel() + 0.3)
        if err > 0:
            y += err * rng.randn(N)
        return X, y
    

    Build a pipeline:

    def PolynomialRegression(degree=2, **kwargs):
        return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
    

    Create a data and a vector(X_test) for testing and visualisation purposes:

    X, y = make_data(200)
    X_test = np.linspace(-0.1, 1.1, 200)[:, None]
    

    Define the GridSearchCV parameters:

    param_grid = {'polynomialfeatures__degree': np.arange(20),
    'linearregression__fit_intercept': [True, False],
    'linearregression__normalize': [True, False]}
    grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
    grid.fit(X, y)
    

    Get the best parameters from our model:

    model = grid.best_estimator_
    model
    
    Pipeline(memory=None,
         steps=[('polynomialfeatures', PolynomialFeatures(degree=4, include_bias=True, interaction_only=False)), ('linearregression', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False))])
    

    Fit the model with the X and y data and use the vector to predict the values:

    y_test = model.fit(X, y).predict(X_test)
    

    Visualize the result:

    plt.scatter(X, y)
    plt.plot(X_test.ravel(), y_test, 'r')
    

    The best fit result

    The full code snippet:

    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import make_pipeline
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.model_selection import GridSearchCV
    
    def make_data(N, err=1.0, rseed=1):
    
        rng = np.random.RandomState(rseed)
        X = rng.rand(N, 1) ** 2
        y = 1. / (X.ravel() + 0.3)
        if err > 0:
            y += err * rng.randn(N)
        return X, y
    
    def PolynomialRegression(degree=2, **kwargs):
        return make_pipeline(PolynomialFeatures(degree), LinearRegression(**kwargs))
    
    
    X, y = make_data(200)
    X_test = np.linspace(-0.1, 1.1, 200)[:, None]
    
    param_grid = {'polynomialfeatures__degree': np.arange(20),
    'linearregression__fit_intercept': [True, False],
    'linearregression__normalize': [True, False]}
    grid = GridSearchCV(PolynomialRegression(), param_grid, cv=7)
    grid.fit(X, y)
    
    model = grid.best_estimator_
    
    y_test = model.fit(X, y).predict(X_test)
    
    plt.scatter(X, y)
    plt.plot(X_test.ravel(), y_test, 'r')
    
    0 讨论(0)
提交回复
热议问题