问题
The task
I have data that looks like this:
I want to fit a generalized linear model (glm) to this from a gamma family using statsmodels
. Using this model, for each of my observations I want to calculate the probability of observing a value that is smaller than (or equal to) that value. In other words I want to calculate:
P(y <= y_i | x_i)
My questions
How do I get the shape and scale parameters from the fitted glm in
statsmodels
? According to this question the scale parameter in statsmodels is not parameterized in the normal way. Can I use it directly as input to a gamma distribution inscipy
? Or do I need a transformation first?How do I use these parameters (shape and scale) to get the probabilities? Currently I'm using
scipy
to generate a distribution for eachx_i
and get the probability from that. See implementation below.
My current implementation
import scipy.stats as stat
import patsy
import statsmodels.api as sm
# Generate data in correct form
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
# Fit model with gamma family and log link
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
# Predict mean
myData['mu'] = mod.predict(exog=X)
# Predict probabilities (note that for a gamma distribution mean = shape * scale)
probabilities = np.array(
[stat.gamma(m_i/mod.scale, scale=mod.scale).cdf(y_i) for m_i, y_i in zip(myData['mu'], myData['y'])]
)
However, when I perform this procedure I get the following result:
Currently the predicted probabilities all seem really high. The red line in the graph is the predicted mean. But even for points below this line the predicted cumulative probability is around 80%. This makes me wonder whether the scale parameter I used is indeed the correct one.
回答1:
In R, you can obtained as estimate of the shape using 1/dispersion (check this post).The naming of the dispersion estimate in statsmodels is a unfortunately scale
. So you did to take the reciprocal of this to get the shape estimate. I show it with an example below:
values = gamma.rvs(2,scale=5,size=500)
fit = sm.GLM(values, np.repeat(1,500), family=sm.families.Gamma(sm.families.links.log())).fit()
This is an intercept only model, and we check the intercept and dispersion (named scale):
[fit.params,fit.scale]
[array([2.27875973]), 0.563667465203953]
So the mean is exp(2.2599) = 9.582131
and if we use shape as 1/dispersion , shape = 1/0.563667465203953 = 1.774096
which is what we simulated.
If I use a simulated dataset, it works perfectly fine. This is what it looks like, with a shape of 10:
from scipy.stats import gamma
import numpy as np
import matplotlib.pyplot as plt
import patsy
import statsmodels.api as sm
import pandas as pd
_shape = 10
myData = pd.DataFrame({'x':np.random.uniform(0,10,size=500)})
myData['y'] = gamma.rvs(_shape,scale=np.exp(-myData['x']/3 + 0.5)/_shape,size=500)
myData.plot("x","y",kind="scatter")
Then we fit the model like you did:
y, X = patsy.dmatrices('y ~ x', data=myData, return_type='dataframe')
mod = sm.GLM(y, X, family=sm.families.Gamma(sm.families.links.log())).fit()
mu = mod.predict(exog=X)
shape_from_model = 1/mod.scale
probabilities = [gamma(shape_from_model, scale=m_i/shape_from_model).cdf(y_i) for m_i, y_i in zip(mu,myData['y'])]
And plot:
fig, ax = plt.subplots()
im = ax.scatter(myData["x"],myData["y"],c=probabilities)
im = ax.scatter(myData['x'],mu,c="r",s=1)
fig.colorbar(im, ax=ax)
来源:https://stackoverflow.com/questions/64174603/how-to-use-scale-and-shape-parameters-of-gamma-glm-in-statsmodels