The multiplicity problem in linkage analysis of gene expression data – the power of differentiating cis- and trans-acting regulators

Huang, Baisong; Rangrej, Jagadish; Paterson, Andrew D; Sun, Lei

doi:10.1186/1753-6561-1-S1-S142

Volume 1 Supplement 1

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

Proceedings
Open access
Published: 18 December 2007

The multiplicity problem in linkage analysis of gene expression data – the power of differentiating cis- and trans-acting regulators

Baisong Huang^1,2,
Jagadish Rangrej¹,
Andrew D Paterson^1,2 &
…
Lei Sun^1,2,3

BMC Proceedings volume 1, Article number: S142 (2007) Cite this article

854 Accesses
7 Citations
Metrics details

Abstract

In this report, we focused on the multiplicity issue in Problem 1 of Genetic Analysis Workshop 15. We investigated and compared the performance of the stratified false-discovery rate control method with the traditional aggregated approach, in an application to genome-wide linkage analyses of single-nucleotide polymorphism-to-gene expression data. We showed the importance of utilizing the available map information and demonstrated the power gained by conducting false-discovery rate control separately for cis and trans regulators under three different frameworks: fixed rejection region, fixed false-discovery rate, and fixed number of rejections.

Background

The routine use of multiple-hypothesis testing in inference for large-scale genetic and genomic data has generated controversies and discussions about appropriate ways to adjust for the multiplicities. The conventional control of family-wise error rate (FWER) strictly regulates the probability of type I error but with a considerable loss of power. The recent methodology based on false-discovery rate (FDR) control [1] is an alternative that provides better power yet controls the occurrence of false positives, and its use has become common in the analyses of microarray gene expression data.

Of particular interest here is the use of auxiliary or prior information in the FDR setting. The stratified FDR method [2] was chosen for this analysis because the prior information required is rather general and does not need any distribution assumptions. Sun et al. [2] investigated the performance of the method under two frameworks and showed that under the fixed rejection region framework, the aggregated FDR is a weighted average of the stratum-specific FDRs; under the fixed FDR framework, the stratified approach identifies more true positives. Recently, Greenwood et al. [3] considered the stratification principle under a third framework: fixed number of rejections. They demonstrated that the stratified approach provides an FDR control at a lower rate.

In this report, we focused on the stratified FDR method [2] and applied it to Problem 1: Genetics of Gene Expression in Humans in Genetic Analysis Workshop 15. We considered the two-stage design of Morley et al. [4] in which genome-wide linkage analyses of single-nucleotide polymorphism (SNP)-to-gene expression data were first carried out and significant results were then selected for further studies. We were particularly interested in the selection procedure and comparisons of the performance between the stratified FDR method and the traditional aggregated FDR control. The prior information considered here is the map distance between a SNP and a gene whose expression is the trait.

Methods

We first performed expression quantitative trait linkage (eQTL) analysis using MERLIN regress v 1.0.1 [5, 6]. We zeroed out the genotypes of the child as well as those of the grandparents when a Mendelian inconsistency was detected. All other genotypes were retained. Marker allele frequencies were estimated from the data and single-point linkage analyses of gene expression data were performed for all pairs of 3554 genes and 2819 SNPs, excluding the ones (genes or SNPs or both) on the sex chromosomes. We then stratified the linkage results by the map distance between each SNP and the gene. Gene positions were obtained from Build 36.1 of the UCSC Genome Browser [7]. SNP locations were obtained from Build 126 of dbSNP [8] (on Build 36.1 of the human genome). We used the definition of Morley et al. [4] in which cis regulators are the SNPs within 5 Mb of genes and trans regulators are the remaining SNPs. Stratum 1 contained cis SNPs and stratum 2 contained trans SNPs, and the aggregated group included all SNPs ignoring stratification. We used m to denote the total number of hypotheses among which m₀ were true nulls, R to denote the number of rejections, and p_(i), i = 1,...,m to denote the linkage p-values while p₍₁₎ ≤ p₍₂₎ ≤ ... ≤ p_(m)were the ordered p-values. We used superscript (k), k = 1, 2 as the stratum indicator. Finally, we applied the stratification principle [2] under the following three frameworks.

Framework I: fixed rejection region

This framework chooses the rejection region in advance, i.e., it rejects all hypotheses with unadjusted p-values less than a pre-determined α value, e.g., α = 0.01%. The corresponding FDR level among the R positives can then be estimated by $F \hat{D} R (α) = (m {\hat{π}}_{0} α) / R$ , where R = {p_(i)≤ α}, ${\hat{π}}_{0}$ is an estimate of the proportion of null hypotheses, π₀ = m₀/m, e.g., ${\hat{π}}_{0}$ (λ) = #{p_i> λ}/(m(1 - λ)) with λ = 0.5. The rejection procedure remains the same for the stratified method using the same α level (R⁽¹⁾ + R⁽²⁾ = R). However, the estimates of FDR among R⁽¹⁾ and R⁽²⁾ can be considerably different from the aggregated FDR, with one stratum estimate closer to 0 and the other closer to 1. For both cases, one thus obtains more information on the specificity of the results.

Framework II: fixed FDR

Under this framework, the targeted FDR level is pre-chosen at a γ level, e.g., γ = 5%. Storey [9] showed that controlling FDR at the γ level is equivalent to rejecting all tests with q-values ≤ γ, and the q-values can be estimated by ${\hat{q}}_{(i)} = \min {{\hat{π}}_{0} m p_{(i)} / i, {\hat{q}}_{(i + 1)}}$ , and ${\hat{q}}_{(m)} = {\hat{π}}_{0} p_{(m)}$ . This method is equivalent to the FDR adjusted p-value method [10, 11]. To fairly compare the performance of the stratified FDR method with the aggregated one, we controlled the FDR at the same level for both Strata 1 and 2 and the aggregated group using the above q-value method. The objective was to show that the total number of rejections R⁽¹⁾ + R⁽²⁾ under stratification is greater than R under aggregation.

Framework III: fixed number of rejections

In this case, the total number of significant results R that merits further study is pre-determined based on, for example, the budget and capacity of a particular chip platform. Without stratification, the choice of R hypotheses is straightforward, i.e., the R tests with the smallest p-values: p₍₁₎,...,p_(R). The corresponding FDR level can be estimated by $F \hat{D} R = (m {\hat{π}}_{0} p_{(R)}) / R$ . Under stratification, one needs to find the optimal configuration of R⁽¹⁾ and R⁽²⁾ such that R⁽¹⁾ + R⁽²⁾ = R and the overall FDR is minimized, where $F \hat{D} R_{s t r a} = (m^{(1)} {\hat{π}}_{0}^{(1)} p_{(R^{(1)})}^{(1)} + m^{(2)} {\hat{π}}_{0}^{(2)} p_{(R^{(2)})}^{(2)}) / R$ . The goal is to show that stratification leads to a smaller FDR rate given the same number of positives allowed. More importantly, the configuration of R⁽¹⁾ and R⁽²⁾ obtained using stratification can differ markedly from the aggregation case. Without stratification, the distribution of R rejections between the two strata is roughly proportional to the number of hypotheses in each stratum; while with stratification, the stratum with smaller π₀ (less noise) and higher power to detect true signals proportionally rejects more hypotheses.

Results

Among the 9,069,390 tests with valid results, more than half (5,000,428) had p-values equal to 1. We eliminated the tests with H² = 0, where H² is the estimated locus-specific heritability, because we observed a significant association between p-value = 1 and H² = 0 (p < 0.0001). Among the remaining 3,855,428 tests, 16,018 were in the cis stratum and 3,839,410 in the trans stratum.

The distributions of the aggregated p-values and partitioned p-values in cis and trans strata were roughly uniform. (Figures illustrating these distributions are available at http://www.utstat.toronto.edu/sun/publications.) As expected, the cis stratum contained a relatively higher proportion of true signals than the trans stratum, which was further confirmed by the smaller estimates of π₀ shown in Table 1 (88.18% vs. 95.06%) and the higher density of small p-values close to zero. Results in Table 1 clearly demonstrated that the stratified method outperformed the aggregated approach under all three frameworks. For example, under the fixed rejection region approach, among the 8043 rejections with aggregated FDR of 4.56%, 129 belonged to the cis stratum with a much lower stratum-specific FDR of 1.09%. In fact, among the expected 366 false positives, only 1 was expected to be from the cis group, a clear gain of information by the use of stratification. Under the fixed FDR framework, the stratified method provided 57 more true positives while controlling the FDR at the same rate as the aggregated approach. Under the fixed number of rejections framework, e.g., R = 2000 in Table 1, the aggregated FDR method rejected the tests with the smallest p-values regardless of the map information. Among the 2000 rejections, 55 belonged to cis stratum and 1945 to trans stratum corresponding to an overall $F \hat{D} R$ = 2.05%. In contrast, the stratified FDR approach allowed different rejection configurations in the two strata, with R⁽¹⁾ = 133 and R⁽²⁾ = 1867 being the optimal one corresponding to an overall $F \hat{D} R$ = 2.00%. Besides the obvious advantage of achieving a smaller FDR, the stratified method allocated proportionally more rejections out of the 2000 to the cis stratum (133 vs. 55), a preferable result given the different characteristics of the cis and trans regulators. Other choices of the total number of rejections, i.e., R = 500, R = 1000 and R = 8000 gave similar results. (Figures illustrating these results are available at http://www.utstat.toronto.edu/sun/publications.)

Table 1 Summary statistics and results under the three frameworks

Full size table

Discussion

The stratified FDR method of Sun et al. [2] provides a simple way of incorporating the available auxiliary or prior information to improve power in the context of multiple hypothesis testing. We applied their stratification principle to the linkage analyses of gene expression data of Problem 1 under three different frameworks. Framework II represents the traditional view of type I error control in which a desirable error rate is pre-determined. However, the nominal level could be too optimistic for a given data set, leading to no rejections, or too liberal, resulting in too many significant results for follow up studies. Therefore, in many applications Frameworks I and III are more applicable and meaningful. Results of our analyses demonstrated clearly that it is advantageous to utilize the available map information under any of the three frameworks considered. However, the stratified method did not outperform the aggregated approach by as large a margin as we had expected given the known characteristic difference between cis and trans regulators. One possible explanation is that the large number of trans regulators overwhelms the potential gain using the current measure of efficiency.

To demonstrate that the smaller FDR estimate in the cis stratum was not an artifact of sampling variation or the highly skewed subset size of the tests but the true biological difference in cis- and trans-acting loci, we randomly sampled a set of 16,018 p-values as the "cis" stratum and performed the corresponding Framework I analyses, and this procedure was repeated 20 times independently. The means of the estimated π₀ and FDR were 94.76% (SE = 0.67%) and 4.73% (SE = 0.9%), respectively. Accounting for the sampling variation, these values differed significantly from those obtained using the available map information ( ${\hat{π}}_{0}$ = 88.18%, $F \hat{D} R$ = 1.09%), while they were essentially the same as those under aggregation ( ${\hat{π}}_{0}$ = 95.03%, $F \hat{D} R$ = 4.56%). In addition, the average number of rejections (i.e., tests with p-values < 0.0001) over the 20 random samples is 33.25 (SE = 6.6), which is close to the expected number of rejections (8043*16018/3855428 = 33.42) in a random sample of 16018 p-values; while the original cis stratum, defined based on the map distance, had 129 rejections (i.e., higher density of small p-values than a random sample).

In our analyses, we used the criterion of Morley et al. [4] to define the cis and trans regulators. Other definitions are possible and will change the FDR results quantitatively but not qualitatively. This is clearly demonstrated by the results shown in Table 2 for which we redefined cis regulators as the SNPs within 3, 10, or 20 Mb of genes. The more interesting and challenging question is the identification of the optimal stratum indicator. Searching through a list of candidates is an obvious but biased approach because it added another level of multiplicity [2].

Table 2 Results of using different distance criterion to define cis regulators

Full size table

It is also possible to exploit the available map information in other ways. For example, one could apply the weighted p-value method [12] by using the map distance as the weighting factor. That is, perform the FDR control on weighted p-value p_w= p/w, where p is the original linkage p-value between a pair of SNP and gene, and w is the corresponding weight inversely proportional to the map distance. Similar to the stratification case above, the choice of a specific weighting scheme is not unique and the identification of the optimal one remains an open problem. In addition, the comparison and connection between the weighted p-value method and the stratified approach is of interest.

References

Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B. 1995, 57: 289-300.
Google Scholar
Sun L, Craiu RV, Paterson AD, Bull SB: Stratified false discovery control for large-scale hypothesis testing with application to genome-wide association studies. Genet Epidemiol. 2006, 30: 519-530. 10.1002/gepi.20164.
Article PubMed Google Scholar
Greenwood CMT, Rangrej J, Sun L: Optimal selection of markers for validation from genome-wide association studies. Genet Epidemiol. 2007, 31: 396-407. 10.1002/gepi.20220.
Article PubMed Google Scholar
Morley M, Molony CM, Weber T, Devlin JL, Ewens KG, Spielman RS, Cheung VG: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.
Article PubMed Central PubMed CAS Google Scholar
Abecasis GR, Cherny SS, Cookson WO, Cardon LR: Merlin-rapid analysis of dense genetic maps using sparse gene flow trees. Nat Genet. 2002, 30: 97-101. 10.1038/ng786.
Article PubMed CAS Google Scholar
Sham PC, Purcell S, Cherny SS, Abecasis GR: Powerful regression-based quantitative-trait linkage analysis of general pedigrees. Am J Hum Genet. 2002, 71: 238-53. 10.1086/341560.
Article PubMed Central PubMed CAS Google Scholar
Karolchik D, Baertsch R, Diekhans M, Furey TS, Hinrichs A, Lu YT, Roskin KM, Schwartz M, Sugnet CW, Thomas DJ, Weber RJ, Haussler D, Kent WJ, University of California Santa Cruz: The UCSC Genome Browser Database. Nucl Acids Res. 2003, 31: 51-54. 10.1093/nar/gkg129.
Article PubMed Central PubMed CAS Google Scholar
dbSNP. [http://0-www-ncbi-nlm-nih-gov.brum.beds.ac.uk/projects/SNP]
Storey JD: A direct approach to false discovery rates. J Roy Stat Soc Ser B. 2002, 64: 479-498. 10.1111/1467-9868.00346.
Article Google Scholar
Yekutieli D, Benjamini Y: Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics. J Stat Plan Inference. 1999, 82: 171-196. 10.1016/S0378-3758(99)00041-5.
Article Google Scholar
Craiu RV, Sun L: Choosing the lesser evil: trade-off between false discovery rate and non-discovery rate. Statistica Sinica.
Roeder K, Bacanu SA, Wasserman L, Devlin B: Using linkage genome scans to improve power of association genome scans. Am J Hum Genet. 2006, 78: 243-252. 10.1086/500026.
Article PubMed Central PubMed CAS Google Scholar

Download references

Acknowledgements

This work was supported in part by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC) to LS. BH is supported by Genome Canada through the Ontario Genomics Institute as per research agreement 2004-OGI-3-05. JR is supported by The Centre for Applied Genomic, Hospital for Sick Children and Genome Canada. ADP holds a Canada Research Chair in Genetics of Complex Diseases.

This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://0-www-biomedcentral-com.brum.beds.ac.uk/1753-6561/1?issue=S1.

Author information

Authors and Affiliations

Research Institute, Hospital for Sick Children, 555 University Avenue, Toronto, Ontario, M5G 1X8, Canada
Baisong Huang, Jagadish Rangrej, Andrew D Paterson & Lei Sun
Department of Public Health Sciences, University of Toronto, 155 College Street, Toronto, Ontario, M5T 3M7, Canada
Baisong Huang, Andrew D Paterson & Lei Sun
Department of Statistics, University of Toronto, 100 St. George Street, Toronto, Ontario, M53 3G3, Canada
Lei Sun

Authors

Baisong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jagadish Rangrej
View author publications
You can also search for this author in PubMed Google Scholar
Andrew D Paterson
View author publications
You can also search for this author in PubMed Google Scholar
Lei Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baisong Huang.

Additional information

Competing interests

The author(s) declare that they have no competing interests.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Huang, B., Rangrej, J., Paterson, A.D. et al. The multiplicity problem in linkage analysis of gene expression data – the power of differentiating cis- and trans-acting regulators. BMC Proc 1 (Suppl 1), S142 (2007). https://0-doi-org.brum.beds.ac.uk/10.1186/1753-6561-1-S1-S142

Download citation

Published: 18 December 2007
DOI: https://0-doi-org.brum.beds.ac.uk/10.1186/1753-6561-1-S1-S142

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

The multiplicity problem in linkage analysis of gene expression data – the power of differentiating cis- and trans-acting regulators

Abstract

Background

Methods

Framework I: fixed rejection region

Framework II: fixed FDR

Framework III: fixed number of rejections

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Keywords

BMC Proceedings

Contact us

Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci

The multiplicity problem in linkage analysis of gene expression data – the power of differentiating cis- and trans-acting regulators

Abstract

Background

Methods

Framework I: fixed rejection region

Framework II: fixed FDR

Framework III: fixed number of rejections

Results

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Proceedings

Contact us