- Open Access
Seeking gene relationships in gene expression data using support vector machine regression
© Yu et al; licensee BioMed Central Ltd. 2007
- Published: 18 December 2007
Several genetic determinants responsible for individual variation in gene expression have been located using linkage and association analyses. These analyses have revealed regulatory relationships between genes. The heritability of expression variation as a quantitative phenotype reflects its underlying genetic architecture. Using support vector machine regression (SVMR) and gene ontological information, we proposed an approach to identify gene relationships in expression data provided by Genetic Analysis Workshop 15 that would facilitate subsequent genetic analyses. A group of related genes were selected for a shared biological theme, and SVMR was trained to form a regression model using the training gene expressions. The model was subsequently used to search for and capture similarly related genes. SVMR shows promising capability in modeling and seeking gene relationships through expression data.
- Quantitative Trait Locus
- Ribosomal Protein Gene
- Linkage Result
- Support Vector Machine Regression
- Biological Family
In genome-wide linkage and association analyses using gene expression data from individuals in 14 CEPH (Centre d'Etude du Polymorphisme Humain) Utah families, Cheung and colleagues [1–3] found that variation in the expression level of the gene chitinase-3-like 2 (CHI3L2) was associated with a single-nucleotide polymorphism (SNP) marker, rs755467, in its promoter region. Other studies have also suggested that variation in a regulatory region of a gene is probably the main mediator of phenotypic divergence in evolution [4, 5]. The expression variation pattern correlates with the genes' genetic architecture. The characteristics of expression variation patterns of genes in a biologically defined group may also describe the landscape of the genetic architectures of the genes. We proposed an approach to determine gene relationships on the basis of expression data so that the underlying genetic and biological classification can be established. Our approach will be useful in expression studies, which usually deal with thousands of genes.
where ξ i and are slack variables that define the "soft margin" to measure the deviation of training samples outside the ε-insensitive zone and C is the regularization parameter that determines the trade-off between model complexity (flatness) and the degree to which deviations larger than ε are tolerated in the optimization formulation.
We selected a certain number of genes in a defined relationship, e.g., sharing the same biological functions or protein family, as a training sample for SVMR with defined parameters (kernel functions, ε, and C), to learn their expression patterns. The learned SVMR was then used to "recruit" new expression data of another gene from outside the training set. With predefined criteria, SVMR judged whether the new gene belongs to the same group. The newly recruited genes then grew into a category of a relationship that is expected to show similarity with the defined relationship in the training set. Comparison of linkage analysis results, ontology information, and/or regulation pathways of the new genes with the existing ones will further evaluate the search results.
SVMR 4-level search strategy and results
Genes that contained highly correlated genes
From the same biological family
Across biological families
Random Walk (all genes)
55 RPa 49 ZFPb
Sample selection criteria
A total of 1000 genes that contained 100 highly correlated genes
all in RP family, all in ZFP family
RP, ZFP, and DEADc
The full data set of all 3554 genes
2 genes per training, 3 trainings
2 to 10 genes
3 genes per training
3 to 20 genes
Training selection criteria
Corr > 0.85, p < 0.001
Randomly from 55 RP genes or from 49 ZFP genes
Only from RP family
Randomly from entire sample
Best training size
Example of training genes
1. 200088_x_at and 200809_x_at (both are different problems for RPL12) (Pearson corr > 0.92 and Spearman corr > 0.90, p < 0.0001)
2. RPL32 and RPS18 (Pearson corr > 0.94, p < 0.0001)
3. DDX3Y and EIF1AY (Pearson corr > 0.9875, p < 0.0001)d
Example of captured genes
1. 200088_x_at and 200809_x_at
2. RPL32, RPS15, RPS18, RPS3A, and RPS28
3. DDX3Y and EIF1AY
1. RPL27, RPS3A(2000099_s_at), RPS3A(201257_x_at), RPS29, RPS28
2. RPS15A, RPS18, RPS12, RPS19
3. Similar results were seen among genes with ZFP family
Study subjects were 194 individuals from 14 CEPH Utah families with 2819 genotyped SNPs across 22 autosomal chromosomes provided by Genetic Analysis Workshop 15. Expression data using 3554 gene probes in lymphoblastoid cells of the above subjects were obtained using Affymetrix Human Focus Arrays. Gene annotation and ontology information were available on 8793 genes, including the 3554 genes probed.
The gene expression data were tested for normality using Shapiro-Wilk and Anderson-Darling tests, and pair-wise correlations were tested using Pearson's correlation test for normally distributed expressions and Kendall and Spearman's test for non-normally distributed expressions. These tests were performed for all phenotypes that were stratified by generations in order to guide a better comparison in later relationship searches.
Quantitative trait locus (QTL) nonparametric linkage (NPL) linkage analyses were carried out using Merlin 1.0.1 for nonparametric QTL with options -qtl and -npl over the 2819 autosomal SNPs. This QTL NPL approach in Merlin provides nonparametric LOD score using quantitative trait-based on a general framework defined in the program's documentation . We used the "1-Mb-to-1-cM" rule to convert the physical map into a genetic one. As a supplemental analysis, QTL regression analysis using Merlin-Regress was performed.
We broke down the given gene ontology information into minimum meaningful phrases and uploaded it into a database using mySQL4.1 for easy query. Genes of a biologically related group were selected using definitions in a database search, e.g., "ribosomal proteins" and "DNA repairing".
We used mySVM  for SVM regression. Once the training data were formed, the target data were assigned either randomly or in a predetermined manner, depending on the search scenario used. The predicted results were then compared with the observed values of the targeted gene, and the mean and standard deviations of the differences were calculated. The final rank of the results was based on both mean and standard deviation values. The lowest values ranked the highest, and usually the top 0.6–1.2% genes were selected as captured targets for further studies. Two types of kernel functions were used: dot and polynomial, with degree 1 through degree 4.
The biological relationship of the targeted genes to the training genes was inspected, comparing both genome-wide linkage results and ontological description and/or regulation pathways using PathwayStudio with ResNet 3.0 database (Ariadne Genomics, Inc.).
Both non-parametric QTL linkage analysis and QTL regression linkage analysis were performed for selected genes on 22 autosomal chromosomes. Our results showed the NPL LOD scores ranged between 1 and 4.92, and the regression LOD scores fluctuated more dramatically, e.g., some LOD scores were >20.
Searching among linear correlated genes
Searching among selected gene groups
We selected a group of genes from the same biological family, a set of 55 ribosomal protein genes. We found this group of genes closely shared similar biological functions, but not all were correlated in their expression data.
We observed increasing specificity as the training set size grew, which seemed to taper off at a training set size of around seven genes. The sensitivity fluctuated slightly when the training set size was two or three but remained at over 96% when the size grew to four and above. A larger training set (with more than seven genes) may overfit the feature, causing difficulty in finding similar targets in the limited gene pool. A similar result was obtained in a zinc finger protein (ZFP) family (Table 1).
Searching across gene groups
Searching using the random walk method
We wanted to be able to discover a relationship directly, using gene expressions to form a training set and subsequently capturing a similar relationship. A random walk was designed to determine the size and makeup of a training set randomly and then to search a full set of gene expression samples randomly. This obviously required very heavy computational support. Therefore, we ran a short version of the plan and had a brief view of the random search outcome.
The pattern in gene expression variation does contain information that reflects the underlying genetic architecture. Using statistical learning machines like SVM can extend the capability to model more complex relationships with which regular statistical models such as regression may have limitations. In our exploration at four different searching levels, we noticed that the selection of genes for the training set, i.e., the definition of a biological relationship, influences the search results considerably. Meanwhile, the SNP composition and density, the heritability of expression data as a quantitative trait, and its distribution mode are major factors affecting both linkage results and SVMR learning quality.
We suggest that carefully processing expression data may help manage the data complexity, for example, through distinguishing heritability level, normality of phenotypic distribution, age stratification, or partitioning data using a defined theme to reduce noise level. But adding one or more dimensions of biological relationship information into the SVM learning process may increase the searching power by improving its specificity and sensitivity.
Our brief attempt at using the random walk method sheds light on the difficulty of discovering gene relationships directly via expression data. Genes in the same regulatory pathways share patterns of expression. Therefore, instead of searching an entire sample space, we plan to focus future research on adopting more effective search strategies such as those using genetic algorithms or other heuristic search approaches.
This article has been published as part of BMC Proceedings Volume 1 Supplement 1, 2007: Genetic Analysis Workshop 15: Gene Expression Analysis and Approaches to Detecting Multiple Functional Loci. The full contents of the supplement are available online at http://0-www.biomedcentral.com.brum.beds.ac.uk/1753-6561/1?issue=S1.
- Cheung VG, Conlin LK, Weber TM, Arcaro M, Jen KY, Morley M, Spielman RS: Natural variation in human gene expression assessed in lymphoblastoid cells. Nat Genet. 2003, 33: 422-425. 10.1038/ng1094.View ArticlePubMedGoogle Scholar
- Morley M, Molony C, Weber T, Devlin J, Ewens K, Spielman R, Cheung V: Genetic analysis of genome-wide variation in human gene expression. Nature. 2004, 430: 743-747. 10.1038/nature02797.View ArticlePubMed CentralPubMedGoogle Scholar
- Cheung VG, Spielman R, Ewens K, Weber T, Morley M, Burdick J: Mapping determinants of human gene expression by regional and whole genome association. Nature. 2005, 437: 1365-1369. 10.1038/nature04244.View ArticlePubMed CentralPubMedGoogle Scholar
- King MC, Wilson AC: Evolution at two levels in humans and chimpanzees. Science. 1975, 188: 107-116. 10.1126/science.1090005.View ArticlePubMedGoogle Scholar
- Wray GA, Hahn MW, Abouheif E, Balhoff JP, Pizer M, Rockman MV, Romano LA: The evolution of transcriptional regulation in eukaryotes. Mol Biol Evol. 2003, 20: 1377-1419. 10.1093/molbev/msg140.View ArticlePubMedGoogle Scholar
- Rüping S: mySVM-Manual. 2000, Dortmund: Lehrstuhl für Informatik 8, University of Dortmund, [http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/]Google Scholar
- Burges C: A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discovery. 1998, 2: 121-167. 10.1023/A:1009715923555.View ArticleGoogle Scholar
- Merlin Documentation. [http://www.sph.umich.edu/csg/abecasis/Merlin/reference/qtl.html]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.