- Open Access
A genome-wide association scan for rheumatoid arthritis data by Hotelling's T2tests
BMC Proceedings volume 3, Article number: S6 (2009)
We performed a genome-wide association scan on the North American Rheumatoid Arthritis Consortium (NARAC) data using Hotelling's T2 tests, i.e., T H based on allele coding and T G based on genotype coding. The objective was to identify associations between single-nucleotide polymorphisms (SNPs) or markers and rheumatoid arthritis. In specific candidate gene regions, we evaluated the performance of Hotelling's T2 tests. Then Hotelling's T2 tests were used as a tool to identify new regions that contain SNPs showing strong associations with disease. As expected, the strongest association evidence was found in the region of the HLA-DRB1 locus on chromosome 6. In the region of the TRAF1-C5 genes, we identified two SNPs, rs2900180 and rs3761847, with the largest and the second largest T H and T G scores among all SNPs on chromosome 9. We also identified one SNP, rs2476601, in the region of the PTPN22 gene that had the largest T H score and the second largest T G score among all SNPs on chromosome 1. In addition, SNPs with the largest T H score on each chromosome were identified. These SNPs may be located in the regions of genes that have modest effects on rheumatoid arthritis. These regions deserve further investigation.
Rheumatoid arthritis (RA) is the most common inflammatory joint disease and has an autoimmune etiology. The exact cause of RA is still unknown, but it is well known that RA has a strong genetic component . The HLA-DRB1 locus has been clearly demonstrated to be associated with RA [2–4]. Other candidate genes, such as PTPN22 and TRAF1-C5, which confer a modest level of risk of RA, have also been identified recently [5, 6]. We conducted a genome-wide association analysis on the data of the North American Rheumatoid Arthritis Consortium (NARAC). The objective of this analysis was to identify associations between single-nucleotide polymorphisms (SNPs) or markers and RA. In specific candidate gene regions, we evaluated the performance of Hotelling's T2 tests on known associations. Then, we used the Hotelling's T2 tests to identify additional SNPs that showed strong association with RA. These SNPs are located in regions that are very likely related to the disease and deserve further investigation.
We used the Hotelling's T2 test developed by Fan and Knapp  and Xiong et al.  to analyze the NARAC data. Consider a case-control design with N cases from an affected population and M controls from an unaffected population. When analyzing SNPs, we study bi-allelic markers with two alleles, which we denoted by 1 and 2 that can form three genotypes 1/1, 1/2 and 2/2. Then a coding vector can be defined for each case/control by either i) genotype coding or ii) allele coding. Let X i and Y j denote the coding vector for the ith case and the jth control, respectively. In our study, X i = (1,0)τ for genotype 1/1, X i = (1,0)τ for genotype 1/2, and X i = (0,0)τ for genotype 2/2 were used in the genotype coding, whereas the allele coding simply counts the number of allele 1 of a genotype. If multiple markers are available, the coding vectors of each case/control can be combined together. For instance, the allele coding vector of a case/control of n SNPs is an n-dimensional vector; and the genotype coding vector of a case/control of n SNPs is 2n-dimensional. For multi-allelic markers, the coding method is described by Fan and Knapp . Let us define a pooled-sample variance covariance matrix by
where and are the mean vectors of cases and controls, respectively. The Hotelling's T2 test statistic  is defined as
In the following, we will denote the Hotelling's T2 for allele coding as T H and the Hotelling's T2 for genotype coding as T G . Assume the sample sizes N and M are large enough so that the large sample theory applies. Under the null hypothesis of no association, the statistic T H (or T G ) is asymptotically distributed as a central chi-square χ2 statistic with n (or 2n) degree(s) of freedom if n SNPs are used in the analysis. Under the alternative hypothesis of association, T H (or T G ) is asymptotically distributed as a non-central chi-square χ2 statistic [7, 8, 10].
Based on the Hotelling's T2 test statistics, we have developed a SAS Macro (hotel_cc.sas) to implement the method, which is available online .
First, we applied the Hotelling's test statistics and performed a genome-wide scan on the NARAC data by analyzing one SNP at a time. The NARAC data contained a total of 2062 individuals (868 cases and 1194 controls). Our analysis used data from 22 autosomes. The RA data of Genetic Analysis Workshop (GAW) 16 included 545,080 SNP-genotype fields from an Illumina 550 k chip (22 autosomes, sex chromosomes, and mitochondria). We dropped all SNPs with low call rates (less than 95%) or not in Hardy-Weinberg equilibrium in the controls (p-value < 10-5) and dropped all SNPs which are not on the autosomes. After this filtering, 490,613 SNPs on 22 autosomes were used in our analysis. The strongest signal was found in the region of the HLA-DRB1 gene on chromosome 6 at location 32,654,524-32,686,031 bp. In Figure 1, Graphs I and II show the Hotelling's test scores for chromosome 6. Both T H and T G scores reached the highest value around the location of 32.5 Mb in the region of HLA-DRB1. Graphs III and IV showed the results in the region of HLA-DRB1 gene (the legend indicates location of the HLA-DRB1 gene). Most of the test scores in the region were very significant.
We present the six SNPs on chromosome 6 with the highest test scores in the left-hand part of Table 1. The most significant result was found at SNP rs2395175 (p-value = 9.25 × 10-144). These SNPs are all located around the HLA-DRB1 gene. It is interesting that both T H and T G reached the highest scores at the same four SNPs (rs2395175, rs660895, rs6910071, and rs2395163). Interstingly, T H reached the 5th highest score at SNP rs3763309 and the 6th highest at SNP rs3763312; conversely, T G reached the 5th highest score at SNP rs3763312 and the 6th highest at SNP rs3763309. Actually, the order of two SNPs for T H and T G that reached the 7th and 8th highest scores switches too; in addition, T H and T G reached the 9th to 13th highest scores at the same SNPs (data not shown). Thus, the region of the HLA-DRB1 gene contains multiple SNPs that are highly associated with RA. In addition, the p-values of the test T G were generally smaller than those of T H , i.e., the genotype coding test T G leads to more significant results than the allele coding test T H . This observation is consistent with the evidence for non-additivity of DRB1 effects .
It is well known that the HLA-DRB1 alleles are associated with RA [1, 2]. We performed an analysis in which HLA-DRB1 alleles *0101, *0102, *0401, *0404, *0405, *0408, *1001, which are components of the shared epitope were treated as risk alleles, and the other alleles were collapsed as one. Here we used the multi-allelic version of the Hotelling's T2 tests . The test score for allele coding was T H = 650.81 with 7 degrees of freedom (p-value = 2.76 × 10-136), and test score for genotype coding was T G = 694.82 with 35 degrees of freedom (p-value = 1.36 × 10-123). The results were consistent with those using individual SNPs above. On the basis of individual SNP analysis, we performed a forward analysis of multiple SNPs. Using the most significant SNP rs2395175 as baseline, we added one SNP a time for an analysis of two SNPs. We identified that each of three SNPs, rs660895, rs6910071, and rs3763312, contributed significant association in addition to the contribution of the base SNP rs2395175 (p-value < 0.01). Moreover, the most significant result was from the two SNPs rs2395175 and rs660895. Then, we added one SNP at a time to the two most significant SNPs; we found each of the two SNPs, rs6910071 and rs3763312, contributed significant association (p-value < 0.01). Finally, four SNPs together were found to be significantly associated with RA (rs2395175, rs660895, rs6910071 and rs3763312; p-value < 0.01).
Graphs V-VIII of Figure 1 showed the results of chromosome 9 (the legend indicates location of theTRAF1-C5 genes). In Plenge et al. , SNP rs3761847 at position 120,769,793 bp and SNP rs2900180 at position 120,785,936 bp were found to be significantly associated with RA in the region of the TRAF1-C5 genes. We found consistent results since T H = 34.21 of SNP rs2900180 was the largest (p-value = 4.95 × 10-9), and T H = 32.17 of SNP rs3761847 was the second largest among all SNPs on chromosome 9 (p-value = 1.41 × 10-8). Other SNPs on chromosome 9 that showed highest scores were also reported on the right-hand side of Table 1. Interestingly, the SNPs identified via T H were the same as the ones identified via T G (the right-hand side of Table 1). As with chromosome 6 in the region HLA-DRB1, we performed a forward analysis of multiple SNPs. Using rs2900180 as baseline, we found no other SNP that contributed significant association (p-value > 0.05). Thus, all association is from SNP rs2900180 in the region of the TRAF1-C5 genes.
In the region of the PTPN22 gene on chromosome 1, we identified one SNP (rs2476601) that was reported to be associated with RA by Begovich et al. . The SNP is located at position 114,089,610 bp on the left-hand side of the PTPN22 gene. The T H = 48.88 of rs2476601 was the largest T H score among all SNPs on chromosome 1 (p-value = 2.72 × 10-12), and the T G = 49.99 of rs2476601 was the second largest (p-value = 1.4 × 10-11, data not shown). In this region, only SNP rs2476601 stood out; other SNPs of top 20 test scores are not located in the region. Hence, we did not analyze multiple SNPs.
From the results in the candidate regions on chromosomes 6, 9, and 1, we noticed that the highest test scores of T H and T G were from SNPs located very close to the candidate genes HLA-DRB1, TRAF1-C5, and PTPN22, respectively. Therefore, the SNPs with high test scores are of interest for further investigation to identify genes that have modest effect on RA. In Table 2, we presented the SNPs that showed the highest T H scores among all SNPs of each chromosome. We chose to present the results based on the test statistic T H , since it is more robust than T G in terms of more stable type I error rates . To make a comparison, we presented the most significant results from PLINK in Table 2. The SNPs identified by statistic T H are the same as those identified by PLINK, except rank switches on chromosomes 11 and 16. It is possible that other SNPs that have high test scores are worthy of further study. Due to the limited length of this article, we could not present detailed genome-wide test data here but we will provide detailed information on request.
The results of our genome-wide scan provided a large number of SNPs that have high test scores. One reason for this is the large sample size of NARAC data. For further study, one may start with the regions that contain the SNPs that have highest test scores, i.e., the regions with strongest signals. The Hotelling's T2 tests do not adjust for population substructures. Thus, some of the strong signals could be due to false positives. Further study is necessary to clarify these issues.
The Hotelling's T2 test does not include a multiplicity adjustment. However, we can perform a very conservative (assuming independence of the tests) Bonferroni analysis as follows. In the RA study, we analyzed 490,163 SNPs in total across the whole human genome. Therefore, there are 490,163 T H (or T G ) tests. For the most significant SNP (rs2900180) with the highest test scores on chromosome 9 on the right hand-side of Table 1, the p-value of T H = 34.21 is 4.95 × 10-9. After adjusting for the multiple tests, the probability to get such a result by chance is 4.95 × 10-9 * 490,163 = 0.0024. Hence, the result is still very significant. For the least significant SNP (rs10985073), the p-value of T H = 28.16 is 1.12 × 10-7. After adjusting for the multiple tests, the probability to get such a result by chance is 1.12 × 10-7 * 490,163 = 0.055, which is close to the 0.05 significance level. The rest of the results in Tables 1 and 2 can be analyzed similarly.
We compared our results with those in literature [2–6] and found them to be consistent. In addition, we analyzed the data using PLINK and found similar results as those of Table 1 and Table 2; partial results are presented in Table 2. Hence, our results for analysis of data from candidate studies and genome-wide scans showed that the Hotelling's tests performed well. Furthermore, we could jointly use multiple SNPs in analysis as we did for data of chromosomes 6 and 9.
We performed a genome-wide association scan for RA data by applying Hotelling's T2 tests. In the candidate regions of the HLA-DRB1, TRAF1-C5, and PTPN22 genes, we identified SNPs that have the highest test scores across chromosomes 6, 9, and 1, respectively. Given the encouraging results in the candidate gene regions, the regions containing SNPs with high test scores are of interest for further investigation to map genes which have modest effects on RA. We provided the SNPs and their positions that had the largest scores for each chromosome. The regions of these SNPs deserve more investigation to map RA genes.
List of sbbreviations used
GAW: Genetic Analysis Workshop; NARAC: North American Rheumatoid Arthritis Consortium; RA: Rheumatoid arthritis; SNP: Single-nucleotide polymorphism
Newton JL, Harney SM, Wordsworth BP, Brown MA: A review of the MHC genetics of rheumatoid arthritis. Genes Immun. 2004, 5: 151-157. 10.1038/sj.gene.6364045.
du Montcel ST, Michou L, Petit-Teixeira E, Osorio J, Lemaire I, Lasbleiz S, Pierlot C, Quillet P, Bardin P, Prum B, Cornelis F, Clerget-Darpoux F: New classification of HLA-DRB1 alleles supports the shared epitope hypothesis of rheumatoid arthritis susceptibility. Arthritis Rheum. 2005, 52: 1063-1068. 10.1002/art.20989.
Gregersen PK, Silver J, Winchester RJ: The shared epitope hypothesis. An approach to understanding the molecular genetics of susceptibility to rheumatoid arthritis. Arthritis Rheum. 1987, 30: 1205-1213. 10.1002/art.1780301102.
Huizinga TW, Amos CI, Helm-van Mil van der AH, jChen W, van Gaalen FA, Jawaheer D, Schreuder GM, Wener M, Breedveld FC, Ahmad N, Lum RF, de Vries RR, Gregersen PK, Toes RE, Criswell LA: Refining the complex rheumatoid arthritis phenotype based on specificity of the HLA-DRB1 shared epitope for antibodies to citrullinated proteins. Arthritis Rheum. 2005, 52: 3433-3438. 10.1002/art.21385.
Begovich AB, Carlton VEH, Honigberg LA, Schrodi SJ, Chokkalingam AP, Alexander HC, Ardlie KG, Huang Q, Smith AM, Spoerke JM, Conn MT, Chang M, Chang S-YP, Saiki RK, Catanese JJ, Leong DU, Garcia VE, McAllister LB, Jeffery DA, Lee AT, Batliwalla F, Remmers E, Criswell LA, Seldin MF, Kastner DL, Amos CI, Sninsky JJ, Gregersen PK: A missense SNP in the protein tyrosine phosphatase PTPN22 is associated with rheumatoid arthritis. Am J Hum Genet. 2004, 75: 330-337. 10.1086/422827.
Plenge RM, Seielstad M, Padyukov L, Lee AT, Remmers EF, Ding B, Liew A, Khalili H, Chandrasekaran A, Davies LRL, Li W, Tan AKS, Bonnard C, Ong RTH, Thalamuthu A, Pettersson S, Liu C, Tian C, Chen WV, Carulli JP, Beckman EM, Altshuler D, Alfredsson L, Criswell LA, Amos CI, Seldin MF, Kastner DL, Klareskog L, Gregersen PK: TRAF1-C5 as a risk locus for rheumatoid arthritis a genomewide study. N Engl J Med. 2007, 357: 1199-1209. 10.1056/NEJMoa073491.
Fan RZ, Knapp M: Genome association studies of complex diseases by case-control designs. Am J Hum Genet. 2003, 72: 850-868. 10.1086/373966.
Xiong MM, Zhao J, Boerwinkle E: Generalized T2 test for genome association studies. Am J Hum Genet. 2002, 70: 1257-1268. 10.1086/340392.
Hotelling H: The generalization of student's ratio. Ann Math Stat. 1931, 2: 360-378. 10.1214/aoms/1177732979.
Chapman NH, Wijsman EM: Genome screens using linkage disequilibrium tests: optimal marker characteristics and feasibility. Am J Hum Genet. 1998, 63: 1872-1885. 10.1086/302139.
Software for Qualitative Traits. [http://stat.tamu.edu/~rfan/software.html/]
Morgan AW, Haroon-Rashid L, Martin SG, Gooi HC, Worthington J, Thomson W, Barrett JH, Emery P: The shared epitope hypothesis in rheumatoid arthritis: evaluation of alternative classification criteria in a large UK Caucasian cohort. Arthritis Rheum. 2008, 58: 1275-1283. 10.1002/art.23432.
The Genetic Analysis Workshops are supported by NIH grant R01 GM031575 from the National Institute of General Medical Sciences. The research is supported by NIH grant AR 44422 (WVC, CIA, and RF) and CA133996. We thank Dr. M Knapp for the consultation on Hotelling's T2 tests and data quality.
This article has been published as part of BMC Proceedings Volume 3 Supplement 7, 2009: Genetic Analysis Workshop 16. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/1753-6561/3?issue=S7.
The authors declare that they have no competing interests.
CIA and RF conceived the main idea of the study. LC, MZ, and WVC performed statistical analysis under the direction of CIA and RF. LC, CIA, and RF wrote the manuscript. MZ and WVC provided comments to improve the writings of the manuscript. All authors read and approved the final manuscript.
About this article
Cite this article
Chen, L., Zhong, M., Chen, W.V. et al. A genome-wide association scan for rheumatoid arthritis data by Hotelling's T2tests. BMC Proc 3, S6 (2009) doi:10.1186/1753-6561-3-S7-S6
- Genetic Analysis Workshop
- Multiple SNPs
- Code Vector
- PTPN22 Gene
- Forward Analysis