Scikit-learn is returning coefficient of determination (R^2) values less than -1

前端 未结 4 947
陌清茗
陌清茗 2021-01-30 18:12

I\'m doing a simple linear model. I have

fire = load_data()
regr = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(regr, fire.data, fir         


        
4条回答
  •  野的像风
    2021-01-30 18:36

    There is no reason r^2 shouldn't be negative (despite the ^2 in its name). This is also stated in the doc. You can see r^2 as the comparison of your model fit (in the context of linear regression, e.g a model of order 1 (affine)) to a model of order 0 (just fitting a constant), both by minimizing a squared loss. The constant minimizing the squared error is the mean. Since you are doing cross validation with left out data, it can happen that the mean of your test set is wildly different from the mean of your training set. This alone can induce a much higher incurred squared error in your prediction versus just predicting the mean of the test data, which results in a negative r^2 score.

    In worst case, if your data do not explain your target at all, these scores can become very strongly negative. Try

    import numpy as np
    rng = np.random.RandomState(42)
    X = rng.randn(100, 80)
    y = rng.randn(100)  # y has nothing to do with X whatsoever
    from sklearn.linear_model import LinearRegression
    from sklearn.cross_validation import cross_val_score
    scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
    

    This should result in negative r^2 values.

    In [23]: scores
    Out[23]: 
    array([-240.17927358,   -5.51819556,  -14.06815196,  -67.87003867,
        -64.14367035])
    

    The important question now is whether this is due to the fact that linear models just do not find anything in your data, or to something else that may be fixed in the preprocessing of your data. Have you tried scaling your columns to have mean 0 and variance 1? You can do this using sklearn.preprocessing.StandardScaler. As a matter of fact, you should create a new estimator by concatenating a StandardScaler and the LinearRegression into a pipeline using sklearn.pipeline.Pipeline. Next you may want to try Ridge regression.

提交回复
热议问题