The NHGRI-EBI Catalog of published genome-wide association studies
EBI负责维护的一个收集已发表的GWAS研究的数据库
Catalog stats
- Last data release on 2019-09-24
- 4220 publications
- 107486 SNPs
- 157336 associations
- Genome assembly GRCh38.p12
- dbSNP Build 151
- Ensembl Build 96
基本的搜索方法
搜索表型:如breast carcinoma,会得到相关的非常规范的表型信息,EFO,就像GO一样,是一套表型分类规则。然后还会得到表型相关的基因。
搜索SNP:如rs7329174,会得到变异的详细信息,和对应的基因。
搜索人名:Yao,会得到相关的文献
搜索染色体位置:如2q37.1,Cytogenetic region
搜索基因:如HBS1L
搜索区域:如6:16000000-25000000
说是数据库,其实就是一个table,从这里下载,不过100MB
表里面有这些数据:
DATE ADDED TO CATALOG* +: Date a study is published in the catalog
PUBMEDID* +: PubMed identification number
FIRST AUTHOR* +: Last name and initials of first author
DATE* +: Publication date (online (epub) date if available)
JOURNAL* +: Abbreviated journal name
LINK* +: PubMed URL
STUDY* +: Title of paper
DISEASE/TRAIT* +: Disease or trait examined in study
INITIAL SAMPLE DESCRIPTION* +: Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable)
REPLICATION SAMPLE DESCRIPTION* +: Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable)
REGION*: Cytogenetic region associated with rs number
CHR_ID*: Chromosome number associated with rs number
CHR_POS*: Chromosomal position associated with rs number
REPORTED GENE(S)*: Gene(s) reported by author
MAPPED GENE(S)*: Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen.
UPSTREAM_GENE_ID*: Entrez Gene ID for nearest upstream gene to rs number, if not within gene
DOWNSTREAM_GENE_ID*: Entrez Gene ID for nearest downstream gene to rs number, if not within gene
SNP_GENE_IDS*: Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts
UPSTREAM_GENE_DISTANCE*: distance in kb for nearest upstream gene to rs number, if not within gene
DOWNSTREAM_GENE_DISTANCE*: distance in kb for nearest downstream gene to rs number, if not within gene
STRONGEST SNP-RISK ALLELE*: SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype.
SNPS*: Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype)
MERGED*: denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;)
SNP_ID_CURRENT*: current rs number (will differ from strongest SNP when merged = 1)
CONTEXT*: SNP functional class
INTERGENIC*: denotes whether SNP is in intergenic region (0 = no; 1 = yes)
RISK ALLELE FREQUENCY*: Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted.
P-VALUE*: Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7).
PVALUE_MLOG*: -log(p-value)
P-VALUE (TEXT)*: Information describing context of p-value (e.g. females, smokers).
OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients.
95% CI (TEXT)*: Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available.
PLATFORM (SNPS PASSING QC)*: Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable
CNV*: Study of copy number variation (yes/no)
ASSOCIATION COUNT+: Number of associations identified for this study
一些问题:
什么是Genotyping technology?
什么是Experimental Factor Ontology trait?
什么是Cytogenetic region?karyotype
什么是trait + risk allele?这里要分清SNP和allele的概念,SNP是位点,而allele则是该位点上碱基。考虑一下DNA双链,以及多倍体。
什么是risk/effect allele frequency?
odds ratio在GWAS里是个什么指标?wiki
The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele.
As an example, suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y).
When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[18]
什么是MAF?the frequency of the minor allele
GWAS数据可以有哪些注释?phenotype annotation、population and linkage disequilibrium (LD) information
什么是CP loci?an effective region associated with at least two phenotypes
什么是genotype-calling?
GWAS的最基本的QC有哪些?
Quality Control Procedures for Genome Wide Association Studies
Data quality control in genetic case-control association studies
- minor allele frequency (MAF) > 0.01; statistical power is extremely low for rare SNPs,很好理解,如果一个非常罕见的SNP,需要非常大的样本量才能有足够的power
- Hardy-Weinberg equilibrium (HWE) test p-value > 5E-05;
- missing genotypes rate < 10%; Genotypes are classified as missing if the genotype-calling algorithm cannot infer the genotype with sufficient confidence. Can be calculated across each individual and/or SNP.
什么是Experimental Factor Ontology?
什么是LD information (r2 and D’ values)?
Mathematical properties of the r2 measure of linkage disequilibrium
待续~