这是之前做的笔记,最近要优化GS的模型,考虑大效应QTL,GWAS和GS结合。温习一下,总结一下实现方法。
编者自语:
asreml是非常强大的软件, 由于太强大, 很多人不会使用. 基因组选择在育种中的应用, 其基础是常规的系谱动物模型, 动物模型也可以很复杂, 看一下asreml的说明书就知道了, 有300多页, 据我了解, 其厚度可以用这个公式表示:
这说明一个问题, Arthur Gilmour教授(asreml的作者)是一个非常有耐心, 也非常厉害的统计学家, 他花费了自己的大半生, 将自己的心血编程了这个软件, 我很佩服.
这个教程是asreml在基因组选择和分子育种中的应用, 下面是我的读书笔记.
一个朋友说, 我们这个圈子很小了, 如果大家再不知道怎么分享, 怎么交流, 那我们这个学科以后怎么办呢, 这也是我停不下来的原因. 尼采说过: 力的过剩, 是力的证明. 他把不务正业说的这么理所应当, 搞得我将斜杠青年进行到底的决心变得更加稳固. 废话少说, 以下是目录.
目录:
简介
这篇文档的主要目标是介绍ASReml在基因组分析中的实现方法, 它假定读者有一定的统计基础. 在本文档中, 不对统计和模型做过多的介绍.
1, 单标记分析
示例数据:
ID,effect,SNP_1,SNP_100,SNP_1000,SNP_101,SNP_102,SNP_103,SNP_104,SNP_105,SNP_106,SNP_107,SNP_108,SNP_109,SNP_11,SNP_110,SNP_111,SNP_112,SNP_113,SNP_114
ID_1,-0.259731957336183,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_10,0.117554666740654,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
ID_100,0.00357380737732867,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_101,0.344906212015101,0,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0
ID_102,0.376403712779367,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1
ID_103,0.131676984710817,0,0,0,0,1,1,0,1,1,1,0,0,0,0,0,0,0,0
ID_104,0.41299708896122,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
ID_105,0.353890056009646,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_106,0.237438809186312,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_107,-0.316455302927825,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0
ID_108,-0.235784805404543,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_109,0.0783501427411017,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
ID_11,0.0919863476998604,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
ID, 观测值为effect, 第三列及以后为SNP 名称.
将每个标记作为固定因子, 循环运行:
!cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
effect ~ mu $I
可以在asr文件中, 查看每个SNP的显著性, 这是单标记方差分析.
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.83 0.363
14 SNP_109 2 651.0 5.20 0.006
Finished: 19 Oct 2018 17:04:23.666 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 13 value is SNP_11
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
Notice: 1 singularities detected in design matrix.
1 LogL= 603.924 S2= 0.56887E-01 652 df
2 LogL= 603.924 S2= 0.56887E-01 652 df
- - - Results from analysis of effect - - -
LogL: 603.92 0.568871E-01 652 2 SNP_11 "LogL Converged"
Akaike Information Criterion -1205.85 (assuming 1 parameters).
Bayesian Information Criterion -1201.37
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.568871E-01 18.06 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 652.0 0.82 0.366
15 SNP_11 1 652.0 1.25 0.264
Finished: 19 Oct 2018 17:04:24.058 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 14 value is SNP_110
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 601.263 S2= 0.56936E-01 651 df
2 LogL= 601.263 S2= 0.56936E-01 651 df
- - - Results from analysis of effect - - -
LogL: 601.26 0.569356E-01 651 2 SNP_110 "LogL Converged"
Akaike Information Criterion -1200.53 (assuming 1 parameters).
Bayesian Information Criterion -1196.05
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.569356E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.82 0.366
16 SNP_110 2 651.0 0.85 0.429
Finished: 19 Oct 2018 17:04:24.499 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 15 value is SNP_111
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 600.791 S2= 0.57054E-01 651 df
2 LogL= 600.791 S2= 0.57054E-01 651 df
- - - Results from analysis of effect - - -
LogL: 600.79 0.570539E-01 651 2 SNP_111 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 12 SNP_109 LogL: 605.70 Deviance: 12.35
Akaike Information Criterion -1199.58 (assuming 1 parameters).
Bayesian Information Criterion -1195.10
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.570539E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.81 0.367
17 SNP_111 2 651.0 0.17 0.843
Finished: 19 Oct 2018 17:04:24.962 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 16 value is SNP_112
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
Notice: 1 singularities detected in design matrix.
1 LogL= 602.714 S2= 0.56989E-01 652 df
2 LogL= 602.714 S2= 0.56989E-01 652 df
- - - Results from analysis of effect - - -
LogL: 602.71 0.569893E-01 652 2 SNP_112 "LogL Converged"
Akaike Information Criterion -1203.43 (assuming 1 parameters).
Bayesian Information Criterion -1198.95
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.569893E-01 18.06 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 652.0 0.82 0.367
18 SNP_112 1 652.0 0.08 0.776
Finished: 19 Oct 2018 17:04:25.435 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 17 value is SNP_113
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 601.723 S2= 0.57001E-01 651 df
2 LogL= 601.723 S2= 0.57001E-01 651 df
- - - Results from analysis of effect - - -
LogL: 601.72 0.570011E-01 651 2 SNP_113 "LogL Converged"
Akaike Information Criterion -1201.45 (assuming 1 parameters).
Bayesian Information Criterion -1196.97
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.570011E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.82 0.367
19 SNP_113 2 651.0 0.47 0.623
Finished: 19 Oct 2018 17:04:25.904 LogL Converged
Folder: D:\spline\snp-asreml
Cycle 18 value is SNP_114
Reading dd.csv FREE FORMAT skipping 1 lines
Univariate analysis of effect
Summary of 654 records retained of 654 read
Warning: Fewer levels found in SNP_1 than specified
Warning: Fewer levels found in SNP_101 than specified
Warning: Fewer levels found in SNP_104 than specified
Warning: Fewer levels found in SNP_11 than specified
Warning: Fewer levels found in SNP_112 than specified
Forming 3 equations: 3 dense.
Initial updates will be shrunk by factor 0.316
1 LogL= 606.497 S2= 0.56038E-01 651 df
2 LogL= 606.497 S2= 0.56038E-01 651 df
- - - Results from analysis of effect - - -
LogL: 606.50 0.560380E-01 651 2 SNP_114 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 18 SNP_114 LogL: 606.50 Deviance: 13.94
Akaike Information Criterion -1210.99 (assuming 1 parameters).
Bayesian Information Criterion -1206.51
Model_Term Gamma Sigma Sigma/SE % C
Residual SCA_V 654 1.00000 0.560380E-01 18.04 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
21 mu 1 651.0 0.83 0.363
20 SNP_114 2 651.0 6.08 0.002
Best LogL 606.50 0.560380E-01 651 2 SNP_114 LogL Converged
Finished: 19 Oct 2018 17:04:26.403 LogL Converged
结果可以看出, 第20(SNP_114)个SNP达到极显著, 第16(SNP_109)个SNP达到显著水平.
我们也可以将其作为随机因子, 查看Log-likehood评价模型. 如果比空模型好(LRT检验), 那说明标记效应明显.
!cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
effect ~ mu !r $I
结果:
LogL: LogL Residual NEDF NIT Cycle Text
LogL: 607.75 0.564653E-01 653 6 SNP_1 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_100 "LogL Converged"
LogL: 606.11 0.569091E-01 653 7 SNP_1000 "LogL Converged"
LogL: 606.37 0.567870E-01 653 4 SNP_101 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_102 "LogL Converged"
LogL: 606.21 0.568392E-01 653 5 SNP_103 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_104 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_105 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_106 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_107 "LogL Converged"
LogL: 606.57 0.567311E-01 653 4 SNP_108 "LogL Converged"
LogL: 609.22 0.561598E-01 653 3 SNP_109 "LogL Converged"
LogL: 606.12 0.568872E-01 653 5 SNP_11 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 12 SNP_109 LogL: 609.22 Deviance: 6.22
LogL: 606.16 0.568635E-01 653 4 SNP_110 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_111 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_112 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 SNP_113 "LogL Converged"
LogL: 608.14 0.560577E-01 653 8 SNP_114 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 18 SNP_114 LogL: 608.14 Deviance: 4.06
同样的结果, 我们可以看到Local CYCLE中 达到Peak的点在SNP_109 6.22 和SNP_114 4.06, 说明这两个SNP位点达到显著性水平.
另一种写法, 应对标记比较多的情况, 不用每个标记都需要用!cycle指定名称, 可以用!G N, N是标记个数进行代替. 这种方法的缺点是没有SNP标记名称.
ID !A # ID_101
effect # 0.344906212015101
Marks !G 18
# !cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
!cycle 1:18
effect ~ mu !r Marks[$I]
结果:
LogL: LogL Residual NEDF NIT Cycle Text
LogL: 607.75 0.564653E-01 653 6 1 "LogL Converged"
LogL: 606.10 0.569091E-01 653 6 2 "LogL Converged"
LogL: 606.11 0.569091E-01 653 7 3 "LogL Converged"
LogL: 606.37 0.567870E-01 653 4 4 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 5 "LogL Converged"
LogL: 606.39 0.567814E-01 653 4 6 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 7 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 8 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 9 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 10 "LogL Converged"
LogL: 606.53 0.567416E-01 653 4 11 "LogL Converged"
LogL: 607.88 0.564391E-01 653 5 12 "LogL Converged"
LogL: 606.12 0.568872E-01 653 5 13 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 12 12 LogL: 607.88 Deviance: 3.55
LogL: 606.11 0.569077E-01 653 5 14 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 15 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 16 "LogL Converged"
LogL: 606.11 0.569091E-01 653 6 17 "LogL Converged"
LogL: 607.67 0.564839E-01 653 5 18 "LogL Converged"
Local CYCLE LogL Peak at CYCLE: 18 18 LogL: 607.67 Deviance: 3.12
查看sln中的BLUP值, 放到excel中排序, 可以看出两个标记比较大:
如果有每个标记的map位置, 我们就可以进行作图.
2, 多标记分析
顾名思义, 就是讲所有Marks放在一起进行分析.
ID !A # ID_101
effect # 0.344906212015101
Marks !G 18
# !cycle SNP_1 SNP_100 SNP_1000 SNP_101 SNP_102 SNP_103 SNP_104 SNP_105 SNP_106 SNP_107 SNP_108 SNP_109 SNP_11 SNP_110 SNP_111 SNP_112 SNP_113 SNP_114
dd.csv !SKIP 1
# !cycle 1:18
# effect ~ mu !r Marks[$I]
# effect ~ mu # LogL= 606.105
effect ~ mu !r Marks
结果:
8 LogL= 607.362 S2= 0.55772E-01 653 df 0.1377E-01
Final parameter values 0.1378E-01
- - - Results from analysis of effect - - -
Akaike Information Criterion -1210.72 (assuming 2 parameters).
Bayesian Information Criterion -1201.76
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
Marks 17.30 0.965402E-01 53.0 1.0
Residual Variance 635.70 0.557723E-01 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
Marks IDV_V 18 0.137842E-01 0.768776E-03 1.24 0 P
Residual SCA_V 654 1.00000 0.557723E-01 17.83 0 P
Wald F statistics
Source of Variation NumDF DenDF F-inc P-inc
4 mu 1 73.6 1.52 0.222
Notice: The DenDF values are calculated ignoring fixed/boundary/singular
variance parameters using algebraic derivatives.
Solution Standard Error T-value T-prev
4 mu
1 -0.167809E-01 0.136271E-01 -1.23
3 Marks 18 effects fitted
空模型的log值是606, Mark模型是607, 轻微提高.
查看sln的BLUP值
3, 基因组选择
理论介绍
GBLUP所依据的公式为:
M是n*m构成的矩阵, n是个体数, m为标记数(marker), g是每个标记的BLUP值. 随着标记数目的增加, m >>n的情况出现导致算法需要调整. 现在通用的是
如果已经计算出G矩阵, 可以使用asreml进行GBLUP的估算, 代码如下:
!work 12 !ARG 1
QTL ANALUSIS
id !P
SEX !A
AGE !A
HEIGHT !M -9999
idbgrm.ped !mark !alpha
ibdgrm.grm !ND !dense
ibdgrm.dat
HEIGHT ~ mu SEX !R nrm(id) grm1(id)
grm 文件为稠密矩阵(dense)的下三角
固定因子为age, sex
随机因子为加性效应, 基因组随机效应
asreml在估算GBlUP时, 会同时给出标记的效应值(marker effect), 结果文件在mef中.
相关的R包, 参考wgaim包
在下一章节中, 我们将对GS的延伸方法: Fast Bayes A进行介绍.
4, 基因组选择的其它方法
EM BayesA-like方法, 参考 Sun et al. (2012)开发而成.
一般标记矩阵的编码方法为: 0 1 2,
0 为major等位基因: eg AA
1 为杂合等位基因: eg Aa
2 为minor等位基因: eg aa
构建矩阵的方法, 公式为:
具体参数:
Bayes A, 假定性状是由主效QTL控制, 少数QTL解释了一大半的变异, 而不是像GBLUP所假定每个标记的有相同的方差(符合正态分布)
Fast Bayes A:
Bayes B的方法在asreml中实现:
marker文件格式:
文件命名为*.mkr
第一列为基因型ID
第一行为SNP ID
mkr中不能有缺失值
标记文件的命令参数, 这些参数都需要和标记文件放在同一行才可以起作用
filename.mkr
!markers m # 标记的个数(可以省略)
!IDS n # 个体的个数(可以省略)
!FBA k # 定义asreml是否使用GBLUP(省略, 为GBLUP, 标记方差一致, k=0), k在Fast BayesA中是标记的方差分布符合逆卡方(inverse Chi-square)分布的参数, 如果使用!FBA, 默认的k=4. 一般来说k需要大于3小于20. 如果!FBA出现, asreml会默认使用!EXTRA 5用于读取mef文件, 当做初始值.
!FBB p # p是百分数, 设置多大比例标记方差组分为0(对应的是标记的效应值也为0), 这里可以定义BayesB
!HEADER 0 # 标记没有行头
!SKIP c # 掉过的行数
!CSKIP # 掉过的列数, 使用!SKIP -1表示第一列没有ID, 是SNP
以下参数不常用
!OFFSET o
!CENTER
!SAVEGIV g
!PENALTY d
!DFOFFSET t
!MSCALE s
!PEV
权重G矩阵
常规GBLUP命令
!wrokspace 1
title: standard GBLUP model
ID *
phenotype
genotype.mrk !markers 10031 !IDS 3226 # 标记文件有10031个SNP, ID有3225个
phenotype.txt !skip 1 !maxit 50 !gdense #使用稠密矩阵(dense)
phenotype ~ mu !r grm1(ID)
residual units
结果说明
基因型个体的GBLUP值在.sln文件中
如果标记ID有1000个, mark文件ID有1500, 则sln文件也会有1500, 另外500为GBLUP预测值(即这部分没有表型值, 根据基因型进行的GBLUP值预测)
标记的效应值在.mef文件中, 如果!PEV在mark文件后面, .mef文件中会有标准误
Fast Bayes A方法命令
很多时候, 我们对一些效应较大的标记感兴趣, 例如QTL, 但是GBLUP估计是收缩是估计(shrunken estimators), QTL的效应值会被周围的标记吸收掉, 导致大效应标记难以发现.
Bayes A的模型可以鉴定少数大效应的标记, 这里的Fast Bayes-A like 方法类似. 对于一些性状, Fast Bayes-A比GBLUP的预测效果更好.
调整对角线D
常规Fast-BayesA命令
!wrokspace 1
title: Fast-BayesA model
ID !A
phenotype
genotype.mrk !markers 10031 !IDS 3226 !FBA 4.2 # 标记文件有10031个SNP, ID有3225个, !FBA 设置为4.2
phenotype.txt !skip 1 !maxit 50
phenotype ~ mu !r grm1(ID) 0.808 !GF # 这里Vg的gamma设置为0.808, 固定方差组分
residual units
结果说明
.mef包括marker的效应值, 以及权重(weight)
.res 包括显著性的SNP
不同的K值, Vg是固定还是估计 比较
结论:
k值为4左右是, 效果比较好
Vg是固定还是估算, 影响不大, 默认估算
5, 使用asreml注意事项
只有一个GRM文件可以用, 如果有多个, 建议转化为giv使用
对于Fast Bayes模型中, 只有一个GRM能够使用, 如果有其它, 使用giv
ID 的顺序要和G的ID顺序一致, 建议将G的ID单独抽取出来, 用!L 定义
!PEV会给出标记的标准误, 结果不可靠
基因型的GBLUP在.sln中, mark的效应在.mef中, 标记的权重(weight)在.mef中, 大效应的标记在.res文件中.
6, asreml基因组选择考虑GWAS和QTL显著性位点
如果已经鉴定出大效应的SNP, 可以放在模型中, 这样模型就可以利用GWAS和QTL的信息, 提高预测的准确性.
snp(ID, 954) snp(ID,4480)
可以作为固定因子, 或者随机因子.
后记
GS中, 多性状GS模型的效果要高于单性状GS, asreml中有很多强大的函数可以利用, 未来可期.
本文分享自微信公众号 - 育种数据分析之放飞自我(R-breeding)。
如有侵权,请联系 support@oschina.cn 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。
来源:oschina
链接:https://my.oschina.net/u/4592498/blog/4463801