Linear regression with pandas dataframe

前端 未结 1 1398
旧巷少年郎
旧巷少年郎 2021-02-03 10:19

I have a dataframe in pandas that I\'m using to produce a scatterplot, and want to include a regression line for the plot. Right now I\'m trying to do this with polyfit.

相关标签:
1条回答
  • 2021-02-03 11:13

    Instead of replacing '#DIV/0!' by hand, force the data to be numeric. This does two things at once: it ensures that the result is numeric type (not str), and it substitutes NaN for any entries that cannot be parsed as a number. Example:

    In [5]: Series([1, 2, 'blah', '#DIV/0!']).convert_objects(convert_numeric=True)
    Out[5]: 
    0     1
    1     2
    2   NaN
    3   NaN
    dtype: float64
    

    This should fix your error. But, on the general subject of fitting a line to data, I keep handy two ways of doing this that I like better than polyfit. The second of the two is more robust (and can potentially return much more detailed information about the statistics) but it requires statsmodels.

    from scipy.stats import linregress
    def fit_line1(x, y):
        """Return slope, intercept of best fit line."""
        # Remove entries where either x or y is NaN.
        clean_data = pd.concat([x, y], 1).dropna(0) # row-wise
        (_, x), (_, y) = clean_data.iteritems()
        slope, intercept, r, p, stderr = linregress(x, y)
        return slope, intercept # could also return stderr
    
    import statsmodels.api as sm
    def fit_line2(x, y):
        """Return slope, intercept of best fit line."""
        X = sm.add_constant(x)
        model = sm.OLS(y, X, missing='drop') # ignores entires where x or y is NaN
        fit = model.fit()
        return fit.params[1], fit.params[0] # could also return stderr in each via fit.bse
    

    To plot it, do something like

    m, b = fit_line2(x, y)
    N = 100 # could be just 2 if you are only drawing a straight line...
    points = np.linspace(x.min(), x.max(), N)
    plt.plot(points, m*points + b)
    
    0 讨论(0)
提交回复
热议问题