问题
I have a nonlinear model fit that looks like this:
The dark solid line is the model fit, and the grey part is the raw data.
Short version of the question: how do I get the likelihood of this model fit, so I can perform log-likelihood ratio test? Assume that the residual is normally distributed.
I am relatively new to statistics, and my current thoughts are:
Get the residual from the curve fit, and calculate the variance of residual;
Use this equation
And plug in the variance of residual into sigma-squared, x_i as experiment and mu as model fit;Calculate the log-likelihood ratio.
Could anyone help me, with these two full-version questions?
Is my method correct? (I think so, but it would be really great to make sure!)
Are there any ready-made functions in python/scipy/statsmodels to do this for me?
回答1:
Your likelihood function
which is simply the sum of log of probability density function of Gaussian distribution.
is the likelihood of fitting a mu and a sigma for your residue, not the likelihood of your model given your data. In one word, your approach is wrong.
Sine you are doing non-linear least square, following what @usethedeathstar already mentioned, you should go straight for F-test
. . Consider the following example, modified from http://www.walkingrandomly.com/?p=5254, and we conduct F-test
using R
. And we will discuss how to translate it into python
in the end.
# construct the data vectors using c()
> xdata = c(-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9)
> ydata = c(0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001)
# some starting values
> p1 = 1
> p2 = 0.2
> p3 = 0.01
# do the fit
> fit1 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata), start=list(p1=p1,p2=p2))
> fit2 = nls(ydata ~ p1*cos(p2*xdata) + p2*sin(p1*xdata)+p3*xdata, start=list(p1=p1,p2=p2,p3=p3))
# summarise
> summary(fit1)
Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 1.881851 0.027430 68.61 2.27e-12 ***
p2 0.700230 0.009153 76.51 9.50e-13 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 0.08202 on 8 degrees of freedom
Number of iterations to convergence: 7
Achieved convergence tolerance: 2.189e-06
> summary(fit2)
Formula: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Parameters:
Estimate Std. Error t value Pr(>|t|)
p1 1.90108 0.03520 54.002 1.96e-10 ***
p2 0.70657 0.01167 60.528 8.82e-11 ***
p3 0.02029 0.02166 0.937 0.38
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1
Residual standard error: 0.08243 on 7 degrees of freedom
Number of iterations to convergence: 9
Achieved convergence tolerance: 2.476e-06
> anova(fit2, fit1)
Analysis of Variance Table
Model 1: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata) + p3 * xdata
Model 2: ydata ~ p1 * cos(p2 * xdata) + p2 * sin(p1 * xdata)
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
1 7 0.047565
2 8 0.053813 -1 -0.0062473 0.9194 0.3696
here we have two model, fit1
has 2 parameters, therefore the residue has 8 degrees-of-freedom; fit2
has one additional parameter and the residue has 7 degrees of freedom. Is model 2 significantly better? No, the F value is 0.9194, on (1,7)
degrees of freedom and it is not significant.
To get the ANOVA table: Residue DF is easy. Residue Sum of squares: 0.08202*0.08202*8=0.05381
and 0.08243*0.08243*7=0.04756293
(notice: 'Residual standard error: 0.08243 on 7 degrees of freedom', etc). In python
, you can get it by (y_observed-y_fitted)**2
, since scipy.optimize.curve_fit()
doesn't return the residues.
The F-ratio
is 0.0062473/0.047565*7
and to get P-value: 1-scipy.stats.f.cdf(0.9194, 1, 7)
.
Put them together we have python
equivalent:
In [1]:
import scipy.optimize as so
import scipy.stats as ss
xdata = np.array([-2,-1.64,-1.33,-0.7,0,0.45,1.2,1.64,2.32,2.9])
ydata = np.array([0.699369,0.700462,0.695354,1.03905,1.97389,2.41143,1.91091,0.919576,-0.730975,-1.42001])
def model0(x,p1,p2):
return p1*np.cos(p2*x) + p2*np.sin(p1*x)
def model1(x,p1,p2,p3):
return p1*np.cos(p2*x) + p2*np.sin(p1*x)+p3*x
p1, p2, p3 = 1, 0.2, 0.01
fit0=so.curve_fit(model0, xdata, ydata, p0=(p1,p2))[0]
fit1=so.curve_fit(model1, xdata, ydata, p0=(p1,p2,p3))[0]
yfit0=model0(xdata, fit0[0], fit0[1])
yfit1=model1(xdata, fit1[0], fit1[1], fit1[2])
ssq0=((yfit0-ydata)**2).sum()
ssq1=((yfit1-ydata)**2).sum()
df=len(xdata)-3
f_ratio=(ssq0-ssq1)/(ssq1/df)
p=1-ss.f.cdf(f_ratio, 1, df)
In [2]:
print f_ratio, p
0.919387419515 0.369574503394
As @usethedeathstar pointed out: when you the residue is normally distributed, nonlinear least square IS the maximum likelihood. Therefore F-test and likelihood ratio test is equivalent. Because, F-ratio is a monotone transformation of the likelihood ratio λ.
Or in a descriptive way, see: http://www.stata.com/support/faqs/statistics/chi-squared-and-f-distributions/
回答2:
Your formula looks correct to me. It should give you the same results as scipy.stats.norm.logpdf(x, loc=mu, scale=sigma)
Since you already have your estimates of mu and sigma, I don't think there is a function for the likelihood ratio test where you can plug your results in.
If you have the estimates of two models, where one is nested in the other, then you can easily calculate it yourself.
http://en.wikipedia.org/wiki/Likelihood-ratio_test
Here is the part of a method in statsmodels that calculates the LR-test for comparing two nested linear models https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py#L1531
来源:https://stackoverflow.com/questions/23004374/how-to-calculate-the-likelihood-of-curve-fitting-in-scipy