How to calculate the likelihood of curve-fitting in scipy?

那年仲夏 提交于 2019-12-21 04:14:04

问题


I have a nonlinear model fit that looks like this:

The dark solid line is the model fit, and the grey part is the raw data.

Short version of the question: how do I get the likelihood of this model fit, so I can perform log-likelihood ratio test? Assume that the residual is normally distributed.

I am relatively new to statistics, and my current thoughts are:

  1. Get the residual from the curve fit, and calculate the variance of residual;

  2. Use this equation

    And plug in the variance of residual into sigma-squared, x_i as experiment and mu as model fit;
  3. Calculate the log-likelihood ratio.

Could anyone help me, with these two full-version questions?

  1. Is my method correct? (I think so, but it would be really great to make sure!)

  2. Are there any ready-made functions in python/scipy/statsmodels to do this for me?


回答1:


Your likelihood function

which is simply the sum of log of probability density function of Gaussian distribution.

is the likelihood of fitting a mu and a sigma for your residue, not the likelihood of your model given your data. In one word, your approach is wrong.

Sine you are doing non-linear least square, following what @usethedeathstar already mentioned, you should go straight for F-test. . Consider the following example, modified from http://www.walkingrandomly.com/?p=5254, and we conduct F-test using R. And we will discuss how to translate it into python in the end.

# construct the data vectors using c()
> xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9)
> ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)
# some starting values
> p1 = 1
> p2 = 0.2
> p3 = 0.01

# do the fit
> fit1 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata), start=list(p1=p1,p2=p2))
> fit2 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata)+p3*xdata, start=list(p1=p1,p2=p2,p3=p3))

# summarise
> summary(fit1)

Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
p1 1.881851   0.027430   68.61 2.27e-12 ***
p2 0.700230   0.009153   76.51 9.50e-13 ***
---
Signif. codes:  0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 0.08202 on 8 degrees of freedom

Number of iterations to convergence: 7 
Achieved convergence tolerance: 2.189e-06

> summary(fit2)

Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
p1  1.90108    0.03520  54.002 1.96e-10 ***
p2  0.70657    0.01167  60.528 8.82e-11 ***
p3  0.02029    0.02166   0.937     0.38    
---
Signif. codes:  0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 0.08243 on 7 degrees of freedom

Number of iterations to convergence: 9 
Achieved convergence tolerance: 2.476e-06

> anova(fit2, fit1)
Analysis of Variance Table

Model 1: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Model 2: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
  Res.Df Res.Sum Sq Df     Sum Sq F value Pr(>F)
1      7   0.047565                             
2      8   0.053813 -1 -0.0062473  0.9194 0.3696

here we have two model, fit1 has 2 parameters, therefore the residue has 8 degrees-of-freedom; fit2 has one additional parameter and the residue has 7 degrees of freedom. Is model 2 significantly better? No, the F value is 0.9194, on (1,7) degrees of freedom and it is not significant.

To get the ANOVA table: Residue DF is easy. Residue Sum of squares: 0.08202*0.08202*8=0.05381 and 0.08243*0.08243*7=0.04756293 (notice: 'Residual standard error: 0.08243 on 7 degrees of freedom', etc). In python, you can get it by (y_observed-y_fitted)**2, since scipy.optimize.curve_fit() doesn't return the residues.

The F-ratio is 0.0062473/0.047565*7 and to get P-value: 1-scipy.stats.f.cdf(0.9194, 1, 7).

Put them together we have python equivalent:

In [1]:

import scipy.optimize as so
import scipy.stats as ss
xdata = np.array([-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9])
ydata = np.array([0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001])
def model0(x,p1,p2):
    return p1*np.cos(p2*x) + p2*np.sin(p1*x)
def model1(x,p1,p2,p3):
    return p1*np.cos(p2*x) + p2*np.sin(p1*x)+p3*x
p1, p2, p3 = 1, 0.2, 0.01
fit0=so.curve_fit(model0, xdata, ydata, p0=(p1,p2))[0]
fit1=so.curve_fit(model1, xdata, ydata, p0=(p1,p2,p3))[0]
yfit0=model0(xdata, fit0[0], fit0[1])
yfit1=model1(xdata, fit1[0], fit1[1], fit1[2])
ssq0=((yfit0-ydata)**2).sum()
ssq1=((yfit1-ydata)**2).sum()
df=len(xdata)-3
f_ratio=(ssq0-ssq1)/(ssq1/df)
p=1-ss.f.cdf(f_ratio, 1, df)
In [2]:

print f_ratio, p
0.919387419515 0.369574503394

As @usethedeathstar pointed out: when you the residue is normally distributed, nonlinear least square IS the maximum likelihood. Therefore F-test and likelihood ratio test is equivalent. Because, F-ratio is a monotone transformation of the likelihood ratio λ.

Or in a descriptive way, see: http://www.stata.com/support/faqs/statistics/chi-squared-and-f-distributions/




回答2:


Your formula looks correct to me. It should give you the same results as scipy.stats.norm.logpdf(x, loc=mu, scale=sigma)

Since you already have your estimates of mu and sigma, I don't think there is a function for the likelihood ratio test where you can plug your results in.

If you have the estimates of two models, where one is nested in the other, then you can easily calculate it yourself.

http://en.wikipedia.org/wiki/Likelihood-ratio_test

Here is the part of a method in statsmodels that calculates the LR-test for comparing two nested linear models https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1531



来源:https://stackoverflow.com/questions/23004374/how-to-calculate-the-likelihood-of-curve-fitting-in-scipy

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!