UA MATH571A 多元线性回归IV 广义线性模型
广义线性模型
Y 1 , Y 2 , . . . , Y N Y_1,Y_2,...,Y_N Y 1 , Y 2 , . . . , Y N 是服从指数分布族某一分布的被解释变量,并且E Y i = μ i EY_i=\mu_i E Y i = μ i ,存在某个函数g g g 使得解释变量与g ( μ i ) g(\mu_i) g ( μ i ) 之间具有线性关系g ( μ i ) = X i β
g(\mu_i) = X_i \beta
g ( μ i ) = X i β
这样的回归模型叫广义线性回归模型。显然当g ( μ i ) = μ i g(\mu_i)=\mu_i g ( μ i ) = μ i 时,回归模型是多元线性回归,当g g g 是Logistics函数的反函数时,是Logistics回归。
二值被解释变量
回归模型Y i = X i β + ϵ i
Y_i = X_i \beta + \epsilon_i
Y i = X i β + ϵ i
中,有时Y i = 0 , 1 Y_i = 0,1 Y i = 0 , 1 ,这类解释变量叫二值被解释变量,这种回归可以用来做两分类问题。如果把Y i Y_i Y i 视为Bernoulli随机变量,则p i = P ( Y i = 1 ) = E [ Y i ] = X i β ^
p_i =P(Y_i = 1)= E[Y_i] = X_i \hat{\beta}
p i = P ( Y i = 1 ) = E [ Y i ] = X i β ^
表示的是成功概率。这个模型比较直白,问题也比较多。
残差项不满足正态假设
给定样本时,残差只有两个可能的取值,当Y i = 0 Y_i=0 Y i = 0 时,ϵ i = − X i β ^ \epsilon_i = -X_i \hat{\beta} ϵ i = − X i β ^ ,当Y i = 1 Y_i=1 Y i = 1 时,ϵ i = 1 − X i β ^ \epsilon_i = 1-X_i \hat{\beta} ϵ i = 1 − X i β ^ ,显然这不服从正态分布。
同方差假设不成立
残差与Y i Y_i Y i 同分布,σ 2 ( ϵ i ) = − X i β ^ ( 1 − X i β ^ ) \sigma^2(\epsilon_i)=-X_i \hat{\beta}(1-X_i \hat{\beta}) σ 2 ( ϵ i ) = − X i β ^ ( 1 − X i β ^ ) ,显然与X i X_i X i 有关,同方差假设不成立。
回归方程取值受限
由于拟合值的含义是概率,因此E [ Y i ] = X i β ^ ∈ [ 0 , 1 ] E[Y_i] = X_i \hat{\beta} \in [0,1] E [ Y i ] = X i β ^ ∈ [ 0 , 1 ] ,否则拟合值无意义。
所以二值被解释变量一般做如下处理。假设Y i c Y_i^c Y i c 是一个连续型的变量,但在观测时只取0和1,假设取值的规律为Y i = 1 , i f Y i c ≤ Y 0 Y i = 0 , i f Y i c > Y 0
Y_i = 1,\ if\ Y_i^c\le Y_0 \\Y_i = 0,\ if\ Y_i^c > Y_0
Y i = 1 , i f Y i c ≤ Y 0 Y i = 0 , i f Y i c > Y 0
则P ( Y i = 1 ) = P ( Y i c ≤ Y 0 )
P(Y_i=1) = P(Y_i^c \le Y_0)
P ( Y i = 1 ) = P ( Y i c ≤ Y 0 )
假设Y i c Y_i^c Y i c 满足线性回归模型Y i c = X i β + ϵ i , ϵ i ∼ N ( 0 , σ c 2 )
Y_i^c = X_i \beta + \epsilon_i,\ \epsilon_i \sim N(0,\sigma^2_c)
Y i c = X i β + ϵ i , ϵ i ∼ N ( 0 , σ c 2 )
则P ( Y i = 1 ) = P ( Y i c ≤ Y 0 ) = P ( X i β + ϵ i ≤ Y 0 ) = P ( ϵ i ≤ Y 0 − X i β ) = P ( ϵ i σ c ≤ Y 0 σ c − X i β σ c )
P(Y_i=1) = P(Y_i^c \le Y_0) = P(X_i \beta + \epsilon_i \le Y_0 )\\
=P(\epsilon_i \le Y_0 - X_i \beta) = P(\frac{\epsilon_i}{\sigma_c} \le \frac{Y_0}{\sigma_c} - X_i \frac{\beta}{\sigma_c})
P ( Y i = 1 ) = P ( Y i c ≤ Y 0 ) = P ( X i β + ϵ i ≤ Y 0 ) = P ( ϵ i ≤ Y 0 − X i β ) = P ( σ c ϵ i ≤ σ c Y 0 − X i σ c β )
Probit模型
用Φ \Phi Φ 表示标准正态分布的分布函数,假设P ( Y i = 1 ) = P ( ϵ i σ c ≤ Y 0 σ c − X i β σ c ) = Φ ( X i β ∗ )
P(Y_i=1) = P(\frac{\epsilon_i}{\sigma_c} \le \frac{Y_0}{\sigma_c} - X_i \frac{\beta}{\sigma_c}) = \Phi(X_i \beta^*)
P ( Y i = 1 ) = P ( σ c ϵ i ≤ σ c Y 0 − X i σ c β ) = Φ ( X i β ∗ )
其中β 0 ∗ = Y 0 σ c − β 0 σ c β i ∗ = − β i σ c
\beta_0^* = \frac{Y_0}{\sigma_c} - \frac{\beta_0}{\sigma_c} \\
\beta_i^* = - \frac{\beta_i}{\sigma_c}
β 0 ∗ = σ c Y 0 − σ c β 0 β i ∗ = − σ c β i
这个模型叫Probit模型。
Logit模型
Logit模型又叫Logistics回归。对于P ( Y i = 1 ) = P ( ϵ i σ c ≤ Y 0 σ c − X i β σ c ) = P ( ϵ i σ c ≤ X i β ∗ )
P(Y_i=1) = P(\frac{\epsilon_i}{\sigma_c} \le \frac{Y_0}{\sigma_c} - X_i \frac{\beta}{\sigma_c}) = P(\frac{\epsilon_i}{\sigma_c} \le X_i \beta^*)
P ( Y i = 1 ) = P ( σ c ϵ i ≤ σ c Y 0 − X i σ c β ) = P ( σ c ϵ i ≤ X i β ∗ )
定义ϵ L = π 3 ϵ i σ c , β = π 3 β ∗
\epsilon_L = \frac{\pi}{\sqrt{3}} \frac{\epsilon_i}{\sigma_c}, \beta = \frac{\pi}{\sqrt{3}} \beta^*
ϵ L = 3 π σ c ϵ i , β = 3 π β ∗
这里的π 3 \frac{\pi}{\sqrt{3}} 3 π 没有别的意思,就是随便给的。假设ϵ L \epsilon_L ϵ L 的分布函数为Logistics函数F ( ϵ L ) = exp ( ϵ L ) 1 + exp ( ϵ L )
F(\epsilon_L) = \frac{\exp{(\epsilon_L)}}{1+\exp{(\epsilon_L)}}
F ( ϵ L ) = 1 + exp ( ϵ L ) exp ( ϵ L )
从而P ( Y i = 1 ) = P ( ϵ i σ c ≤ X i β ∗ ) = exp ( X i β ) 1 + exp ( X i β ) = 1 1 + exp ( − X i β )
P(Y_i=1) = P(\frac{\epsilon_i}{\sigma_c} \le X_i \beta^*) = \frac{\exp{(X_i \beta)}}{1+\exp{(X_i \beta)}} = \frac{1}{1+\exp{(-X_i\beta)}}
P ( Y i = 1 ) = P ( σ c ϵ i ≤ X i β ∗ ) = 1 + exp ( X i β ) exp ( X i β ) = 1 + exp ( − X i β ) 1
这个模型就是Logit模型。相比Probit模型,Logit模型的系数更好解释。exp ( X i β ) = P ( Y i = 1 ) P ( Y i = 0 ) ⟺ X i β = l n ( P ( Y i = 1 ) P ( Y i = 0 ) ) = l n ( o d d i )
\exp{(X_i\beta)} = \frac{P(Y_i=1)}{P(Y_i=0)} \Longleftrightarrow X_i\beta = ln(\frac{P(Y_i=1)}{P(Y_i=0)}) = ln(odd_i)
exp ( X i β ) = P ( Y i = 0 ) P ( Y i = 1 ) ⟺ X i β = l n ( P ( Y i = 0 ) P ( Y i = 1 ) ) = l n ( o d d i )
其中P ( Y i = 1 ) P ( Y i = 0 ) \frac{P(Y_i=1)}{P(Y_i=0)} P ( Y i = 0 ) P ( Y i = 1 ) 叫做输赢比(odd ratio),因此系数的含义是解释变量对对数输赢比的单位影响。β = ∂ l n ( o d d i ) ∂ X i
\beta = \frac{\partial ln(odd_i)}{\partial X_i}
β = ∂ X i ∂ l n ( o d d i )
系数的最大似然估计
定义p i = P ( Y i = 1 ) p_i=P(Y_i=1) p i = P ( Y i = 1 ) ,显然Y i Y_i Y i 服从Bernoulli分布B e r ( p i ) Ber(p_i) B e r ( p i ) ,因此f ( Y i ) = p i Y i ( 1 − p i ) 1 − Y i
f(Y_i) = p_i^{Y_i}(1-p_i)^{1-Y_i}
f ( Y i ) = p i Y i ( 1 − p i ) 1 − Y i
样本的对数似然函数为l ( β ) = ∑ i = 1 N [ Y i l n p i + ( 1 − Y i ) l n ( 1 − p i ) ] = ∑ i = 1 N Y i ln ( p i 1 − p i ) + ∑ i = 1 N l n ( 1 − p i )
l(\beta) = \sum_{i=1}^N [Y_ilnp_i+(1-Y_i)ln(1-p_i)] \\= \sum_{i=1}^N Y_i \ln{(\frac{p_i}{1-p_i})} + \sum_{i=1}^N ln(1-p_i)
l ( β ) = i = 1 ∑ N [ Y i l n p i + ( 1 − Y i ) l n ( 1 − p i ) ] = i = 1 ∑ N Y i ln ( 1 − p i p i ) + i = 1 ∑ N l n ( 1 − p i )
其中p i p_i p i 是β \beta β 的函数p i = 1 1 + exp ( − X i β ) , 1 − p i = 1 1 + exp ( X i β )
p_i = \frac{1}{1+\exp{(-X_i\beta)}},1-p_i=\frac{1}{1+\exp{(X_i\beta)}}
p i = 1 + exp ( − X i β ) 1 , 1 − p i = 1 + exp ( X i β ) 1 ln ( p i 1 − p i ) = X i β
\ln{(\frac{p_i}{1-p_i})} =X_i\beta
ln ( 1 − p i p i ) = X i β
因此l ( β ) = ∑ i = 1 N Y i X i β − ∑ i = 1 N l n ( 1 + exp ( X i β ) )
l(\beta) = \sum_{i=1}^N Y_i X_i \beta- \sum_{i=1}^N ln(1+\exp{(X_i\beta)})
l ( β ) = i = 1 ∑ N Y i X i β − i = 1 ∑ N l n ( 1 + exp ( X i β ) )
这个对数似然函数最大化的解析解没法计算出来,通常考虑数值方法求解,比较常用的是上一篇博文中提到的Iterative Reweighted Least Square.
系数的推断
定义对数似然函数的Hessian Matrix为G = ∂ 2 l ( β ) ∂ β 2
G = \frac{\partial^2 l(\beta) }{\partial \beta^2}
G = ∂ β 2 ∂ 2 l ( β )
则系数最大似然估计的方差估计为s 2 ( β ^ ) = ( − G ( β ^ ) ) − 1
s^2(\hat{\beta}) = (-G(\hat{\beta}))^{-1}
s 2 ( β ^ ) = ( − G ( β ^ ) ) − 1
Wald检验
在Logistics回归中,对单个系数的检验需要Wald检验。H 0 : β 1 = 0 H a : β 1 ≠ 0
H_0: \beta_1 = 0 \\
H_a: \beta_1 \ne 0
H 0 : β 1 = 0 H a : β 1 = 0
当样本数足够大时,Z = β ^ 1 − β 1 s ( β ^ 1 ) ∼ N ( 0 , 1 )
Z=\frac{\hat{\beta}_1 - \beta_1}{s(\hat{\beta}_1)} \sim N(0,1)
Z = s ( β ^ 1 ) β ^ 1 − β 1 ∼ N ( 0 , 1 )
因此Wald检验其实是一种Z检验。
似然比检验
似然比检验可以用来对部分或者所有系数进行检验。H 0 : β q = β q + 1 = . . . = β p − 1 = 0 H a : n o t a l l c o e f f i c i e n t s i n H 0 e q u a l t o z e r o
H_0: \beta_q = \beta_{q+1} = ... =\beta_{p-1} = 0 \\
H_a: not\ all\ coefficients\ in\ H_0\ equal\ to\ zero
H 0 : β q = β q + 1 = . . . = β p − 1 = 0 H a : n o t a l l c o e f f i c i e n t s i n H 0 e q u a l t o z e r o
Suppose L ( R ) L(R) L ( R ) to be likelihood of reduced model and L ( F ) L(F) L ( F ) to be likelihood of full model. When sample size is large enoughG 2 = − l n ( L ( R ) L ( F ) ) ∼ χ 2 ( p − q )
G^2 = -ln(\frac{L(R)}{L(F)}) \sim \chi^2(p-q)
G 2 = − l n ( L ( F ) L ( R ) ) ∼ χ 2 ( p − q )
二项回归
假设对解释变量的取值X j , j = 1 , 2 , . . . , c X_j, j= 1,2,...,c X j , j = 1 , 2 , . . . , c 做了n j n_j n j 次重复观察,得到的被解释变量的取值为Y i j , j = 1 , 2 , . . . , n j Y_{ij},j=1,2,...,n_j Y i j , j = 1 , 2 , . . . , n j ,则解释变量取值为X j X_j X j 时,观察到被解释变量等于1的次数为Y . j = ∑ i = 1 n j Y i j ∼ B i n o m ( n j , p j )
Y_{.j} = \sum_{i=1}^{n_j} Y_{ij} \sim Binom(n_j,p_j)
Y . j = i = 1 ∑ n j Y i j ∼ B i n o m ( n j , p j ) p j = 1 1 + exp ( − X j β )
p_j = \frac{1}{1+\exp{(-X_j\beta)}}
p j = 1 + exp ( − X j β ) 1
这个模型叫二项回归(Binomial Regression)。
拟合优度检验
拟合优度检验(Goodness of fit test)的作用是检验模型整体的拟合质量,对小部分拟合不好的地方不会很敏感。如果样本数据存在replication,可以使用卡方拟合优度检验或者Deviance拟合优度检验,如果样本数据不存在replication,可以使用Hosmer-Lemeshow拟合优度检验。Logistics回归的拟合优度检验假设为H 0 : P ( Y i = 1 ) = 1 1 + exp ( − X i β ) H a : P ( Y i = 1 ) ≠ 1 1 + exp ( − X i β )
H_0: P(Y_i=1) = \frac{1}{1+\exp{(-X_i\beta)}} \\
H_a: P(Y_i=1) \ne \frac{1}{1+\exp{(-X_i\beta)}}
H 0 : P ( Y i = 1 ) = 1 + exp ( − X i β ) 1 H a : P ( Y i = 1 ) = 1 + exp ( − X i β ) 1
Pearson卡方拟合优度检验
使用二项回归关于样本replication的设定,O 1 j = Y . j , O 0 j = n j − Y . j
O_{1j} = Y_{.j}, O_{0j} = n_j - Y_{.j}
O 1 j = Y . j , O 0 j = n j − Y . j
在原假设下,假设拟合值为p ^ j = 1 1 + exp ( − X i β ^ )
\hat{p}_j = \frac{1}{1+\exp{(-X_i\hat{\beta})}}
p ^ j = 1 + exp ( − X i β ^ ) 1
被解释变量取1和0的理论样本数为E 1 j = n j p ^ j , E 0 j = n j ( 1 − p ^ j )
E_{1j} = n_j \hat{p}_j, E_{0j} = n_j (1 - \hat{p}_j)
E 1 j = n j p ^ j , E 0 j = n j ( 1 − p ^ j )
当p < c p<c p < c 且replication数量n j n_j n j 足够大时χ 2 = ∑ j = 1 c [ ( O 0 j − E 0 j ) 2 E 0 j + ( O 1 j − E 1 j ) 2 E 1 j ] ∼ χ 2 ( c − p )
\chi^2 = \sum_{j=1}^c [ \frac{(O_{0j}-E_{0j})^2}{E_{0j}} + \frac{(O_{1j}-E_{1j})^2}{E_{1j}} ] \sim \chi^2(c-p)
χ 2 = j = 1 ∑ c [ E 0 j ( O 0 j − E 0 j ) 2 + E 1 j ( O 1 j − E 1 j ) 2 ] ∼ χ 2 ( c − p )
Deviance拟合优度检验
Deviance拟合优度检验用的是似然比检验的思想。其reduced model是Logistics回归,full model是E ( Y i j ) = p j
E(Y_{ij}) = p_j
E ( Y i j ) = p j
这时似然比检验的统计量又被叫做Deviance,记作D e v ( X ) Dev(X) D e v ( X ) ,它代表这两个模型之间的deviation。
Hosmer-Lemeshow拟合优度检验
Hosmer-Lemeshow拟合优度检验是卡方拟合优度检验的直接推广,n j = 1 n_j=1 n j = 1 代表样本数据没有replication,n j > 1 n_j>1 n j > 1 代表样本数据有replication。
多值被解释变量
假设Y i Y_i Y i 有m m m 个可能的取值,不失一般性,假设为1 , 2 , . . . , m 1,2,...,m 1 , 2 , . . . , m 。假设Y i Y_i Y i 服从Boltzmann分布P ( Y i = j ) = 1 1 + ∑ k = 1 m − 1 exp ( − X i β k ) , j = 1 P ( Y i = j ) = exp ( X i β j ) 1 + ∑ k = 1 m − 1 exp ( − X i β k ) , j > 1
P(Y_i=j) = \frac{1}{1+\sum_{k=1}^{m-1}\exp{(-X_i \beta_k)}}, j = 1 \\ P(Y_i=j) = \frac{\exp{(X_i \beta_j)}}{1+\sum_{k=1}^{m-1}\exp{(-X_i \beta_k)}}, j > 1
P ( Y i = j ) = 1 + ∑ k = 1 m − 1 exp ( − X i β k ) 1 , j = 1 P ( Y i = j ) = 1 + ∑ k = 1 m − 1 exp ( − X i β k ) exp ( X i β j ) , j > 1
这个模型叫多值Logit模型(MLogit)。Boltzmann分布用来描述第j j j 个能级的粒子数,假设共有m m m 个能级,则第j j j 个能级的粒子数占系统总粒子数的比值为p j = e x p ( − ϵ j / k T ) ∑ k = 1 m e x p ( − ϵ k / k T ) = e x p ( ( ϵ 1 − ϵ j ) / k T ) ∑ k = 1 m e x p ( ( ϵ 1 − ϵ k ) / k T )
p_j = \frac{exp(-\epsilon_j/kT)}{\sum_{k=1}^m exp(-\epsilon_k/kT)} = \frac{exp((\epsilon_1-\epsilon_j)/kT)}{\sum_{k=1}^m exp((\epsilon_1-\epsilon_k)/kT)}
p j = ∑ k = 1 m e x p ( − ϵ k / k T ) e x p ( − ϵ j / k T ) = ∑ k = 1 m e x p ( ( ϵ 1 − ϵ k ) / k T ) e x p ( ( ϵ 1 − ϵ j ) / k T )
Boltzmann分布用来描述多分类任务的想法还是比较直接。系数β k \beta_k β k 的大小就相当于是ϵ k − ϵ 1 \epsilon_k-\epsilon_1 ϵ k − ϵ 1 ,即第k k k 能级与第1能级的能量差。能级的能量越大,能级的粒子数就越少,因此当解释变量的取值范围为正实数时,系数β k \beta_k β k 越大,Y i = k Y_i=k Y i = k 的可能性就越小。第1能级包含的粒子数占比被标准化为1 1 + ∑ k = 2 m exp ( ϵ 1 − ϵ k ) / k T ) ,
\frac{1}{1+\sum_{k=2}^{m}\exp{(\epsilon_1-\epsilon_k)/kT)}},
1 + ∑ k = 2 m exp ( ϵ 1 − ϵ k ) / k T ) 1 ,
对应在mLogit中就是Y i = 1 Y_i=1 Y i = 1 的概率。