Using statsmodel estimations with scikit-learn cross validation, is it possible?

后端 未结 4 2025
無奈伤痛
無奈伤痛 2021-01-31 19:05

I posted this question to Cross Validated forum and later realized may be this would find appropriate audience in stackoverlfow instead.

I am looking for a way I can use

相关标签:
4条回答
  • 2021-01-31 19:17

    For reference purpose, if you use the statsmodels formula API and/or use the fit_regularized method, you can modify @David Dale's wrapper class in this way.

    import pandas as pd
    from sklearn.base import BaseEstimator, RegressorMixin
    from statsmodels.formula.api import glm as glm_sm
    
    # This is an example wrapper for statsmodels GLM
    class SMWrapper(BaseEstimator, RegressorMixin):
        def __init__(self, family, formula, alpha, L1_wt):
            self.family = family
            self.formula = formula
            self.alpha = alpha
            self.L1_wt = L1_wt
            self.model = None
            self.result = None
        def fit(self, X, y):
            data = pd.concat([pd.DataFrame(X), pd.Series(y)], axis=1)
            data.columns = X.columns.tolist() + ['y']
            self.model = glm_sm(self.formula, data, family=self.family)
            self.result = self.model.fit_regularized(alpha=self.alpha, L1_wt=self.L1_wt, refit=True)
            return self.result
        def predict(self, X):
            return self.result.predict(X)
    
    0 讨论(0)
  • 2021-01-31 19:23

    Indeed, you cannot use cross_val_score directly on statsmodels objects, because of different interface: in statsmodels

    • training data is passed directly into the constructor
    • a separate object contains the result of model estimation

    However, you can write a simple wrapper to make statsmodels objects look like sklearn estimators:

    import statsmodels.api as sm
    from sklearn.base import BaseEstimator, RegressorMixin
    
    class SMWrapper(BaseEstimator, RegressorMixin):
        """ A universal sklearn-style wrapper for statsmodels regressors """
        def __init__(self, model_class, fit_intercept=True):
            self.model_class = model_class
            self.fit_intercept = fit_intercept
        def fit(self, X, y):
            if self.fit_intercept:
                X = sm.add_constant(X)
            self.model_ = self.model_class(y, X)
            self.results_ = self.model_.fit()
        def predict(self, X):
            if self.fit_intercept:
                X = sm.add_constant(X)
            return self.results_.predict(X)
    

    This class contains correct fit and predict methods, and can be used with sklearn, e.g. cross-validated or included into a pipeline. Like here:

    from sklearn.datasets import make_regression
    from sklearn.model_selection import cross_val_score
    from sklearn.linear_model import LinearRegression
    
    X, y = make_regression(random_state=1, n_samples=300, noise=100)
    
    print(cross_val_score(SMWrapper(sm.OLS), X, y, scoring='r2'))
    print(cross_val_score(LinearRegression(), X, y, scoring='r2'))
    

    You can see that the output of two models is identical, because they are both OLS models, cross-validated in the same way.

    [0.28592315 0.37367557 0.47972639]
    [0.28592315 0.37367557 0.47972639]
    
    0 讨论(0)
  • 2021-01-31 19:27

    Though I think this is not technically scikit-learn, there is the package pmdarima that wraps statsmodel and provides a scikit-learn like interface.

    0 讨论(0)
  • 2021-01-31 19:28

    Following the suggestion of David (which gave me an error, complaining about missing function get_parameters) and the scikit learn documentation, I created the following wrapper for a linear regression. It has the same interface of sklearn.linear_model.LinearRegression but in addition has also the function summary(), which gives the info about p-values, R2 and other statistics, as in statsmodels.OLS.

    import statsmodels.api as sm
    from sklearn.base import BaseEstimator, RegressorMixin
    import pandas as pd
    import numpy as np
    
    from sklearn.utils.multiclass import check_classification_targets
    from sklearn.utils.validation import check_X_y, check_is_fitted, check_array
    from sklearn.utils.multiclass import unique_labels
    from sklearn.utils.estimator_checks import check_estimator
    
    
    
    class MyLinearRegression(BaseEstimator, RegressorMixin):
        def __init__(self, fit_intercept=True):
    
            self.fit_intercept = fit_intercept
    
    
        """
        Parameters
        ------------
        column_names: list
                It is an optional value, such that this class knows 
                what is the name of the feature to associate to 
                each column of X. This is useful if you use the method
                summary(), so that it can show the feature name for each
                coefficient
        """ 
        def fit(self, X, y, column_names=() ):
    
            if self.fit_intercept:
                X = sm.add_constant(X)
    
            # Check that X and y have correct shape
            X, y = check_X_y(X, y)
    
    
            self.X_ = X
            self.y_ = y
    
            if len(column_names) != 0:
                cols = column_names.copy()
                cols = list(cols)
                X = pd.DataFrame(X)
                cols = column_names.copy()
                cols.insert(0,'intercept')
                print('X ', X)
                X.columns = cols
    
            self.model_ = sm.OLS(y, X)
            self.results_ = self.model_.fit()
            return self
    
    
    
        def predict(self, X):
            # Check is fit had been called
            check_is_fitted(self, 'model_')
    
            # Input validation
            X = check_array(X)
    
            if self.fit_intercept:
                X = sm.add_constant(X)
            return self.results_.predict(X)
    
    
        def get_params(self, deep = False):
            return {'fit_intercept':self.fit_intercept}
    
    
        def summary(self):
            print(self.results_.summary() )
    

    Example of use:

    cols = ['feature1','feature2']
    X_train = df_train[cols].values
    X_test = df_test[cols].values
    y_train = df_train['label']
    y_test = df_test['label']
    model = MyLinearRegression()
    model.fit(X_train, y_train)
    model.summary()
    model.predict(X_test)
    

    If you want to show the names of the columns, you can call

    model.fit(X_train, y_train, column_names=cols)
    

    To use it in cross_validation:

    from sklearn.model_selection import cross_val_score
    scores = cross_val_score(MyLinearRegression(), X_train, y_train, cv=10, scoring='neg_mean_squared_error')
    scores
    
    0 讨论(0)
提交回复
热议问题