Region-Based Tests for Association Analysis of Rare Variants

Badri Padhukasahasram*

Center for Health Policy and Health Services Research, Henry Ford Health System, USA

*Corresponding author: Badri Padhukasahasram, Center for Health Policy and Health Services Research, Henry Ford Health System, Detroit, Michigan, 1 Ford Place 3A, Henry Ford Health System, Detroit, Michigan, USA.

Received: January 05, 2015; Accepted: February 18, 2015; Published: March 02, 2015

Abstract

Despite numerous discoveries based on genome wide association studies of common variants, the heritability of most complex traits remains largely unexplained. Rare variants may play a significant role in disease risk and phenotypic variation. Such variants are known to be associated with mendelian disorders and rare forms of common diseases. They are also known to be associated with complex diseases. Dramatic advances in DNA sequencing technologies have enabled a more comprehensive evaluation of the full spectrum of genetic variation and now enable us to evaluate the role of low frequency and rare variation in complex traits. In this review, I provide an overview of the various methods that are available for testing simultaneous association of multiple rare variants with disease or any other phenotypes in the context of sequencing based association studies. The tests focus on rare variation from a particular genomic region such as a gene and its surrounding regions. I discuss the basic underlying ideas behind many currently available approaches for region-based association testing of rare variants as well as their advantages and limitations.

Keywords: Rare variants; Burden tests; Variance-component tests; Ominbus tests; Exponential combination tests; Power; Regression; Gene-based association; Region-based association

Introduction

Common Disease Common Variant hypothesis (CDCV) has been a main driver of numerous Genome Wide Association Studies (GWASs) in the last decade [1]. The CDCV hypothesis asserts that common diseases are caused by common variants (frequency > 5%) with low to modest effects [2-5]. The studies of such variants have led to a large number of discoveries [6] and have yielded valuable insights into the genetic basis of complex phenotypes [7-12]. Despite these discoveries, for most complex traits, a large fraction of the genetic contribution as would be expected from heritability estimates (e.g. from twin studies) remains unexplained. For example, for Type 2 Diabetes and Crohn’s disease even with sample sizes of association studies reaching a range of > 100,000, all the current discoveries taken together can only explain ~11% and ~23% of the respectively of the heritability. This so called ‘missing heritability’ problem has received a great deal of attention in the recent times and several explanations [13,14] have been formulated to account for the rest of the genetic contribution to disease and complex traits. If heritability estimates available are accurate, then the missing genetic contribution could be in the form of variation that has not been as extensively investigated as common variation. Because of the CDCV hypothesis, GWASs have focused on the identification of common variants with Minor Allele Frequency (MAF) larger than 5%; however, the rest of the frequency spectrum may contain additional trait-associated variation (e.g. low frequency variants MAF in range 1-5% and rare variation MAF < 1%).

In particular, rare variants can play a significant role in disease risk and phenotypic variation. Such variants are known to be associated with mendelian disorders and rare forms of common diseases [15]. There is also a growing body of evidence that rare variants are associated with complex phenotypes [16-22]. Dramatic advances in DNA sequencing technologies now enabled us to evaluate the role of low frequency and rare variation in complex traits [23-25] High-throughput sequencing technologies can generate billions of short reads across the genome at a reasonable cost and have made whole-exome and whole-genome sequencing studies feasible. Improved sequencing technologies as well as rare-variant genotyping chips [26] have led to genome wide scans for detecting rare variant associations. These are also referred to as Rare Variants Association Studies (RVAS). In [27], sequencing of whole exomes was carried out in 3,734 individuals to test for associations with plasma triglyceride levels. Carriers of rare loss-of-function mutations in the APOC3 gene were found to have 39 percent lower triglyceride levels than noncarriers, as well as better cholesterol levels. In [28], analysis of rare coding variation in 3,871 autism cases and 9,937 ancestry-matched or parental controls revealed 22 autosomal genes. In [29], researchers sequenced the exomes of 2,536 cases with schizophrenia and 2,543 unrelated controls. Schizophrenia cases had a significantly higher rate of rare disruptive mutations in protein-coding schizophrenia candidate genes.

In contrast to common variants, the detection and subsequent association testing with rare variants presents many challenges. Firstly, large sample sizes are needed simply to observe a rare variant in the sample. Secondly, the standard single-variant association tests designed for common variants are underpowered when used for finding rare variant associations. Because deep whole genome sequencing of large sample sizes is currently cost prohibitive, the first issue can be solved by alternate strategies such as targeted sequencing [30], exome sequencing [31], extreme-phenotype sampling [32-35] and low-coverage sequencing [36,37]. To address the power issue, numerous region-based multi-marker tests have been proposed in the last several years [38,39]. In this review, I provide an overview of the various methods that are available for testing simultaneous association of multiple rare variants with disease or any other phenotypes in the context of sequencing based association studies. The tests focus on rare variation from a particular genomic region such as a gene and its surrounding regions. I discuss the basic underlying ideas behind many currently available approaches for region-based association testing of rare variants as well as their advantages and limitations.

Methods for association analysis of rare variants

In the classical single-variant association testing, linear or logistic regression is used for association testing and a genome wide p value threshold of 5 x 10-8 is used to account for multiple testing correction (1 million independent tests) [40]. Regression-based approaches allow us to easily adjust for covariates. For the same effect size, the power to detect association with a rare variant is expected to be smaller than for common variants [39]. The sample size needed to achieve over 80% power with rare variants is at least an order magnitude higher than common variants. Furthermore, because the total number of rare variants across the genome is also larger than common variants, correction for multiple testing will further reduce power in this case. Region-based tests of association seek to aggregate cumulative effects of multiple genetic variants in a gene or region instead of testing each variant individually. When many variants from a relevant gene or genomic region are associated with a complex trait, they may increase the power to detect such associations. Instead of testing millions of rare variants, we can test ~20,000 or so gene regions and this can help reduce the multiple testing burdens. Methods for rare variant association analysis can be classified into 4 major categories: burden tests, variance component tests, combined burden and variancecomponent tests and the exponential-combination test.

Burden tests

The main idea behind burden tests is to collapse information for multiple genetic variants into a single variable and test for associations between this variable and disease status [41-46]. There are many ways to combine the information from multiple genetic variants into a single score such as counting the number of minor alleles for all variants and weighting them to get a composite score. The weights can be based on minor allele frequency as well as functional information based on where a particular variant is located in the genome. These different methods are based on different assumptions about disease mechanism. In general, burden tests make strong assumptions that all the variants in a set are causal and have same direction and effect size. When a large proportion of variants are indeed causal and have same direction of effect, such tests can be powerful. Violation of these assumptions can lead to loss of power [47-49].

Adaptive burden tests [50-55] are refinements to the original burden tests idea that allow for variants to have effects in both directions. They are more robust than original burden tests because they make fewer assumptions about the underlying genetic model at each locus. At the same time, adaptive tests based on regression are often difficult and unstable for rare variants and those that make use of permutation are computationally intensive. Han et al. [50] developed a data-adaptive sum test that first estimates the direction of effect for each variant and then uses the estimated directions to conduct a burden test. The step-up test [51] refines the procedure to use a model-selection framework that assigns zero weight when a variant is unlikely to be associated.

Variance-component tests

These types of tests use a random-effects model and construct a variance-component test that evaluates the distribution of genetic effects for a set of variants. Instead of aggregating variants, these tests evaluate the distribution of the aggregated score test statistics. The Sequence Kernel Association (SKAT) [56-59], sum of squared score test [57] and the C-alpha test [58] are all based on this principle. SKAT allows for both covariate adjustment and modeling of interactions between variants. The test statistic is a weighted sum of squares of individual score statistics and asymptotically follows a mixture chisquare distribution. The p value can be computed rapidly using analytic formulas [60,61]. Variance component tests are powerful in the presence of both phenotype-increasing and phenotypedecreasing variants as well as in cases where only a small proportion of the variants are causal. However, these are less powerful than burden tests when most variants are causal and have effects are in the same direction.

Omnibus tests

Because burden and variance component tests are complementary in terms of the scenarios in which they attain high power, it is desirable to combine these two approaches. Derkach et al. [62] use Fisher’s method [63] to combine the p values of these two tests and make use of permutation to evaluate the significance of the test. Another approach is to use the data to adaptively combine the SKAT and burden test statistics. Lee et al. [64] propose a linear combination of SKAT and burden test statistics. An adaptive procedure is used to find the optimal way to combine test statistics and p values are calculated through one-dimensional numerical integration. Combined tests are attractive in practice because they do not assume a particular genetic architecture and in most situations we do not have strong priors for the underlying genetic model. However, such tests can be slightly less powerful than the previous 2 categories of tests when the assumptions underlying those tests are satisfied.

Exponential combination tests

In contrast to burden and variance component tests that use linear or quadratic combination of score statistics, this test makes use of an exponential sum of the score statistics [65]. The test statistic is developed under a Bayesian framework with a sparse alternative prior with the assumption that only one variant in a genomic region is causal. The significance of the test is determined through the use of permutations. Because the exponential function increases rapidly, the exponential-combination test can have higher power when only a small proportion of the variants are causal but becomes less powerful when moderate or large proportions of variants are causal. Because the null distribution of the test statistic is unknown, permutations are used to obtain p values, making the test computationally intensive.

Relative performance: power and type 1 error rates

Although numerous rare-variant association methods have been proposed, a comprehensive comparison of their performance in terms of power and false positive rates had been lacking until recently. Dering et al. [66] compared 15 conceptually different rare-variant association methods using simulation data for Genetic Analysis Workshop17 [67] as well as empirical data investigating methotrexate clearance in Acute Lymphoblastic Leukemia (ALL) diseased children [68]. The results of testing these 15 approaches [42-48,50,53,54,56,69- 71] indicated that unexpectedly, many of proposed rare-variant association testing approaches have substantially inflated Type 1 error rates. Specifically, only methods proposed in [47,53,54,70] had valid Type 1 error rates for all the simulation scenarios considered in that study. Among all the tests with valid false positive rate, the method proposed in [47] had the largest power in the 4 scenarios that were investigated in the simulations. Findings from the empirical dataset were consistent with the comparisons from simulation study. Both simulations and analysis of real data showed that the power of collapsing based methods heavilyrelieson the proportion of causal variants in the region of interest [72]. Furthermore, not all of these approaches allow for covariate adjustment and methods assuming only a genetic effect may be at disadvantage when phenotype is influenced by covariates.

In conclusion, the study of the association of rare variants or groups of rare-variants is likely to be a major focus of future genetic association studies as we try to better understand the genetic basis of complex traits. The function of a gene can be altered by mutations in many different positions and all of these can influence the phenotype. Genes rarely work in isolation and multiple rare variants occurring in different genes that are part of a biological pathway can together affect phenotype expression. This motivates the development of valid tests that can look at the collective association of rare (and possibly common) variation in genes and biological pathways and such tests can also enhance power as compared to single variant analyses. In general, the analysis of rare variants is complicated by low power, lack of knowledge of the underlying genetic model as well as the difficulty of calling rare genotypes. Prior information about the functional importance of a variant site as derived from computational prediction tools and biological knowledge can guide the choice of regions of interest to detect truer are variant associations. Irrespective of the method used, studies with small to moderate sample sizes are likely to suffer from lack of power. Even when sufficiently large sample sizes are available, rare-variant association testing methods that rely on permutations require huge computational effort making them less appealing in practice as compared to other valid asymptotic methods.