- Open Access
A two-step multiple-marker strategy for genome-wide association studies
© Aschard et al; licensee BioMed Central Ltd. 2007
- Published: 18 December 2007
Genome-wide association studies raise study-design and analytical issues that are still being debated. Among them, stands the issue of reducing the number of markers to be genotyped without loss of efficiency in identifying trait loci, which can reduce the cost of studies and minimize the multiple testing problem. With this aim, we proposed a two-step strategy based on two analytical methods suited to examine sets of markers rather than single markers: the local score, which screens the genome to select candidate regions in Step 1, and FBAT-LC, a multiple-marker family-based association test used to obtain significance levels of regions at step 2. The performance of this strategy was evaluated on all replicates of Genetic Analysis Workshop 15 Problem 3 simulated data, using the answers to that problem. Overall, seven of the nine generated trait loci were detected in at least 87% of the replicates using a framework designed to handle either association with the disease or association with the severity of disease. This multiple-marker strategy was compared to the single-marker approach. By considering regions instead of single markers, this strategy minimizes the multiple testing problem and the number of false-positive results.
- Candidate Region
- Single Marker
- Family Data
- Genetic Analysis Workshop
- Local Score
Genome-wide association studies with hundreds of thousands of markers (SNPs), as made possible by new high-throughput genotyping technologies, raise many study-design and analytical issues, among which the multiple testing problem occupies a central role. Several strategies have been proposed to confront this problem, including one-stage and multiple stage study designs, analytical approaches in one or multiple steps, and use of one or multiple data sets [1, 2]. The two-stage study design, which consists of genotyping many markers in an initial sample at a first stage and a subset of selected SNPs in another sample at a second stage, has often been chosen for cost reasons. However, it has been shown to have other advantages because it allows the number of analyzed markers to be decreased, thus minimizing the multiple testing problem, while maintaining adequate power . To further minimize the multiple testing problem, a number of methods have been proposed for the joint analysis of neighboring marker loci, including haplotype analysis and multiple regression-based methods . As a result, it appears relevant to select whole genomic regions rather than single markers at a first stage of genome-wide association studies, but, to our knowledge, this has been scarcely considered until now. We are proposing a two-step strategy based on two new methods that each have the ability to examine sets of markers rather than single markers: the local score statistic, which can be used to select genomic regions based on a sequence of association signals at a first stage, and FBAT-LC (linear combination of family-based association tests) , which allows testing for association with sets of markers in the selected regions at a second stage. Using sums, the local score statistic identifies accumulations of high statistics in a sequence. In molecular biology, this method has been applied to the localization of hydrophobic domains in proteins and the identification of similar regions among two or more sequences . It was recently applied to association studies for the detection of significant local high-scoring segments from case-control data . The second method, FBAT-LC, is a new extension of FBAT for the joint analysis of multiple markers in family data that does not require haplotype reconstruction . Our goal was to assess the statistical performance of the proposed two-step multiple-marker strategy by analyzing the rheumatoid arthritis (RA) case-control and affected sib-pair (ASP) simulated data (Problem 3 of Genetic Analysis Workshop 15), using the set of 9187 SNPs distributed across the genome. Our aim was also to compare this multiple-marker strategy to the single-marker based approach.
Two-step multiple-marker strategy
We propose a flexible multiple-marker analytical approach for genome-wide association studies made up of two steps. In the first step, the local score method is applied to case-control data in order to detect and rank candidate regions across the genome. It serves as a screening tool. In the second step, these candidate regions are tested for association with the studied phenotype in a sample of family data using FBAT-LC  and the p-values obtained are then corrected for multiple testing. Each of these two steps is independent from each other and can be modified according to the type of data collected. We chose to make full use of Problem 3 data, which included both case-control and family data, thus guiding the choice of the test statistics suited to these data.
Step 1: Detecting candidate regions
The local score method used the Pearson chi-square statistic applied to the case-control genotypic contingency table for each marker to produce a sequence of scores .
defines the local score assigned to X. In practice, it corresponds to the value of the region with the maximal sum of scores X i . Consequently, the variables X i must be negative on average otherwise the best region would easily span the entire sequence. This definition is restrained to the highest-scoring region. The next high-scoring ones are potentially interesting as well because the data set may contain more than one trait locus (TL). We define the kth best region as the local score of the initial sequence disjoint from the preceding k - 1 best regions. In this case H(1) > ... > H(k) are the scores of the k first and distinct highest-scoring genomic regions. Advantages over simple-marker strategies arise from the ability of this statistic to identify a set of candidate genomic regions that may contain genes involved in the disease.
The algorithm of the local score approach includes the three following procedures: i) producing the initial sequence X: we assign to each marker a statistic of association (X i ) corresponding, in our case, to the Pearson chi-square test of case-control marker genotype frequencies. A constraint of this strategy is to have Xnegative on average; that does not happen with positive statistics such as Pearson chi-square, so a constant δ must be subtracted from the whole signal X. In this study, δ corresponds to the value of statistic X i at the classical 5% level and we let X'= X- δ; ii) identifying the highest-scoring region: a simple approach to get the local score from X ' consists of comparing the value of for all possible regions [a; b] but excluding those regions spanning different chromosomes; iii) identifying the next high-scoring regions by using an iterative algorithm: find the highest-scoring region, remove it from X', and apply the algorithm again until there are no more positive local scores in the sequence. At the end, the number of tests has been reduced from M markers to N candidate genomic regions ranked according to their local scores.
Step 2: Testing candidate regions for association
The new FBAT extension proposed by Xu et al.  was used to analyze the regions selected in Step 1 in the family data. This method allows testing multiple markers simultaneously without haplotype reconstruction, and provides significance levels. In brief, the FBAT-LC test proposed by Xu et al.  is based on a linear combination of single-marker FBAT test statistics using data-driven weights, where marker weight derivation is based on the "conditional mean model" . The FBAT test for each bi-allelic marker is carried out for only one allele. When assuming an additive model, this test does not depend on the selected allele. Finally, for the p-values obtained for all candidate regions, different corrections for multiple testing were compared: no correction, Benjamini and Hochberg correction, and Bonferroni correction. A region was considered significant if the corrected p-value was less than 5%.
Performance of the multiple-marker two-step strategy and comparison with the single-marker approach
We assessed the ability of our strategy to reveal regions containing the trait loci by comparing the results obtained from the analysis of all Problem 3 case-control and family data replicates with the answers that were provided. Because the local score was applied to case-control data in Step 1 and FBAT-LC to ASP data in Step 2, we formed 50 replicates of association-study data sets, each set being made of two independent samples: one replicate of case-control data and one replicate of family data. Each case-control data replicate included 1500 cases (one case drawn at random from each ASP) and 2000 controls genotyped for the 9187 SNPs. Each family data replicate included 1500 ASPs genotyped for all SNPs belonging to the candidate regions selected in Step 1.
To evaluate the performance of our strategy, we first identified, in each replicate, the true positive and the true negative regions among those selected in Step 1, a region being defined as positive if it contained at least one of the two flanking markers of any hidden trait locus. We then derived the three following quantities: 1) sensitivity, which is the proportion of true-positive regions that were correctly identified by FBAT-LC test; 2) specificity, which is the proportion of true-negative regions that were correctly identified by FBAT-LC test; 3) the false-discovery rate (FDR), which is the proportion of false positives among the declared significant results. An average estimate and standard deviation of each of these three quantities were computed over the 50 replicates of family data. The estimates of these quantities were compared according to the correction applied to the FBAT-LC p-values. The average proportion of trait loci detected by our two-step approach over the 50 replicates of association-study data sets was also derived.
We then conducted a two-stage single-marker analysis to be compare with our multiple-marker strategy. All 9187 genotyped SNPs were ranked according to the p-values associated with the Pearson chi-square test applied to the case-control genotypic contingency table. A number, M, of markers with the smallest p-values to be analyzed in Step 2 was selected. M was equal to the average number of markers belonging to the regions selected by the local score method over 50 replicates. In Step 2, a single-marker FBAT was applied to each of the M selected markers and p-values were either not corrected or corrected using either Benjamini and Hochberg or Bonferroni corrections. To be comparable with the above definition of a true-positive region, true-positive markers among the M selected markers were those flanking each trait locus. Estimates of the same performance indicators, as defined above, were derived over the 50 replicates of association-study data sets.
In Step 1, the local score method revealed an average of 381 regions (standard deviation (sd) = 7.79) with positive scores. These regions contained 472 SNPs on average (sd = 3.30). The distribution of the number of SNPs per region showed that Region 1 contained 38 markers on average, Regions 2 to 6 had more than 2 SNPs and up to 4 SNPs on average, the next 18 regions contained 2 SNPs, and the remaining ones had only 1 SNP.
Results for the first 10 regions in the first replicates
IDs of the 2 extreme markers bounding a region
Bonferroni corrected p-value
Trait loci in the region
DR, C, D
Comparison of multiple marker and single marker strategies
Correction for multiple testing
% of all TL detectedb
Benjamini & Hochberg
Benjamini & Hochberg
Performance of the multiple-marker strategy in disease severity analysis
Correction for multiple testing
% of severity loci detectedb
Benjamini & Hochberg
Overall, our results show that the present two-step strategy based on sets of markers provides significant evidence for all four loci affecting RA risk (DR, C, D, E), one QTL for IgM (F) and two loci influencing RA severity in almost all replicates, provided appropriate test statistics are used in Step 1 to compute the local scores. These regions were always detected at the first step for the five former loci and in 87% of replicates for the two severity loci. All of these regions were confirmed at least 94% of the time in Step 2. This shows the efficiency and flexibility of this overall strategy, which can use different test statistics within the same framework. However, loci involved in more complex interactions (A, B) were difficult to identify, which may be partly due to the relatively small importance of these interactions and/or weak linkage disequilibrium of these loci with the analyzed markers.
When comparing the proposed multiple-marker strategy to the single-marker approach, these two strategies showed similar power to detect RA loci and had both high sensitivity and specificity. However, while the FDR associated with the multiple-marker FBAT-LC decreased significantly when p-values were corrected for multiple testing, the FDR associated with the single-marker FBAT remained high. Previous simulations had shown that the local score statistic was more powerful than the single-marker approach in case-control data . The present findings may be partly due to the generated model in which several loci, especially those on chromosome 6, played an important role in the disease and were thus likely to be always detected. Thus, further comparisons of multiple and single-marker-based methods in other data sets generated under different models appear warranted.
The results presented here were obtained for the first 50 selected regions with highest local scores in Step 1, which were followed up in Step 2. However, varying the number of selected regions from 10 regions to all regions with positive local scores had a small impact on sensitivity, specificity, and FDR as well as on proportion of TLs detected. This shows that the performance of the proposed strategy was already satisfactory even for a small number of selected regions. However, this may be at least partly due to the strong effect of most TLs on RA. Nevertheless, selecting a small number of regions (50, or as few as 10 regions) in Step 1 might be an appropriate strategy that can minimize the multiple testing problem, although disease models other than the one simulated here need to be explored before drawing a definite conclusion.
We used a two-stage analytical approach using two different statistical methods applied to two independent data sets. However, in the context of a two-stage design for genome-wide association studies, Skol et al.  have shown that the joint analysis of the two steps was more efficient than the independent analysis of each step, this analysis being based on single-marker tests of marker allele frequencies in case-control data. Comparison of this latter strategy to the one proposed here would be worth conducting, but the framework of this comparison needs to be further defined.
We used here the local score method as a simple screening tool in a two-stage design. However, this approach can also stand on its own in genome-wide association studies. The significance of local scores can be determined via the extreme values theory. Indeed, under the null hypothesis (H0), the local score is known to follow the GUMBLE distribution asymptotically. However, this asymptotic approximation is only valid under linkage equilibrium, which generally does not hold. A Monte-Carlo simulation-based version taking these dependencies into account has been implemented (available at http://stat.genopole.cnrs.fr/software/lhisa), but simulations increase the time of execution notably. The simple use of the local score method to rank regions to be further tested in another data set, as proposed here, was fast to run because the overall two-step strategy took less than 10 minutes to analyze one sample of 2000 cases/1500 controls in Step 1 and one sample of 1500 affected sib pairs in Step 2. Finally, the proposed strategy is also flexible because it allows different types of data and different test statistics at each step to be considered. Use of a case-control sample in Step 1 might be preferred because it requires less cost and less time to collect data , and using a family-based method in Step 2 protects against population stratification.
This work was supported by grants from Ministry of Research (ACI-IMPbio-03-2-621), Agence Nationale pour la Recherche (ANR 05-SEST-020-02/05-9-97), Institut National du Cancer (INCa-PL 016), the EU Framework Programme for Research (contract FP6-LSH-2004-5-018996/GABRIEL project), and Serono. We thank Grégory Nuel (statistique et génome) and Jérôme Wojcik. We also thank the scientific committee of GAW15 for having selected this paper for the Novel Methods Session as well as the persons responsible for GAW15 simulated data.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/1753-6561/1?issue=S1.
- Skol AD, Scott LJ, Abecasis GR, Boehnke M: Joint analysis is more efficient than replication-based analysis for two-stage genome-wide association studies. Nat Genet. 2006, 38: 209-213. 10.1038/ng1706.View ArticlePubMedGoogle Scholar
- Van Steen K, McQueen MB, Herbert A, Raby B, Lyon H, DeMeo DL, Murphy A, Su J, Datta S, Rosenow C, Christman M, Silverman EK, Laird NM, Weiss ST, Lange C: Genomic screening and replication using the same data set in family-based association testing. Nat Genet. 2005, 37: 683-691. 10.1038/ng1582.View ArticlePubMedGoogle Scholar
- Wang H, Thomas DC, Pe'er I, Stram DO: Optimal two-stage genotyping designs for genome-wide association scans. Genet Epidemiol. 2006, 30: 356-368. 10.1002/gepi.20150.View ArticlePubMedGoogle Scholar
- Cordell HJ, Clayton DG: Genetic association studies. Lancet. 2005, 366: 1121-1131. 10.1016/S0140-6736(05)67424-7.View ArticlePubMedGoogle Scholar
- Xu X, Rakovski C, Xu X, Laird N: An efficient family-based association test using multiple markers. Genet Epidemiol. 2006, 30: 620-626. 10.1002/gepi.20174.View ArticlePubMedGoogle Scholar
- Karlin S: Statistical signals in bioinformatics. Proc Natl Acad Sci USA. 2005, 102: 13355-13362. 10.1073/pnas.0501804102.View ArticlePubMed CentralPubMedGoogle Scholar
- Guedj M, Robelin D, Hoebeke M, Lamarine M, Wojcik J, Nuel G: Detecting local-high scoring segments: a first-stage approach for genome-wide association studies. Stat Appl Genet Mol Biol. 2006, 5: Article22-PubMedGoogle Scholar
- Lange C, DeMeo D, Silverman EK, Weiss ST, Laird NM: Using the noninformative families in family-based association tests: a powerful new testing strategy. Am J Hum Genet. 2003, 73: 801-811. 10.1086/378591.View ArticlePubMed CentralPubMedGoogle Scholar
- McGinnis R, Shifman S, Darvasi A: Power and efficiency of the TDT and case-control design for association scans. Behav Genet. 2002, 32: 135-144. 10.1023/A:1015205924326.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.