Penncnv-Exomeseq: Genotype Improves Copy Number Variant Detection in Exome Sequencing

Research Article

Austin J Proteomics Bioinform & Genomics. 2015;2(1): 1009.

Penncnv-Exomeseq: Genotype Improves Copy Number Variant Detection in Exome Sequencing

Joseph T Glessner¹*, Jin Li¹, Yichuan Liu¹, Lifeng Tian¹, Kelly A Thomas¹, Ryan Golhar¹, Akshatha Desai¹, Bao-Li Chang¹, Xiao Chang¹ and Hakon Hakonarson¹,²

1Center for Applied Genomics,The Children’s Hospital of Philadelphia, Philadelphia, USA

2University of Pennsylvania, Perleman School of Medicine, Philadelphia, USA

*Corresponding author: Joseph Glessner, Bioinformatics Specialist, Center for Applied Genomics, The Children’s Hospital of Philadelphia, 3615 Civic Center Boulevard,Philadelphia, USA

Received: January 21, 2015; Accepted: March 11, 2015; Published: March 13, 2015

Abstract

Background: Sequencing has become a popular method for the generation of large-scale genomic data and with the inundation of such data source comes the necessity for accurate genotype calling of nucleotide bases (A/T/C/G) and copy number (0/1/2/3/4) variants (CNV). The use of SNP arrays as a point of reference for widely used assays for genomic variants, including the variability of different centers and algorithms impacting quality may bear fruit. PennCNV is the most popular method for CNV detection from SNP arrays. Therefore, we observe the unique features that set it apart: namely using both intensity and genotype in tandem to infer CNV states using an HMM and trio based recalling of CNVs to bring de novo rates to an acceptably low level.

Results: Sequencing offers features to assess CNVs intensity which has been leveraged by a number of algorithms, including XHMM, but the valuable feature of genotype for call accuracy has not been incorporated. Here we show derivation of genotype frequency from exome sequencing as a robust data to supplement intensity data in CNV detection.We detect more CNVs at a higher true positive rate than existing methods.

Conclusion: This application of BAF furthermore allows an arsenal of tools to be utilized including PennCNV and ParseCNV for sequencing data. PennCNVExomeSeq is freely available at https://penncnvexomeseq.sourceforge.net/.

Keywords: Copy number variant; Whole exome sequencing; Detection; Association

Abbreviations

ZPCARD: z-Score of Principal Components Analysis Normalized Read Depth; WES: Whole Exome Sequencing;BAF: B Allele Frequency; LRR: Log R Ratio; XHMM: Exome Hidden Markov Model; GATK: Genome Analysis Toolkit

Background

Motivated by the shift from SNP arrays to exome sequencing as the most commonly used genomic data, variant calling must be optimized for exome sequencing [1] and whole genome sequencing data sets [2, 3]. Multiple algorithms exists, including Conifer[4], ExomeCNV [5], ExomeCopy [6], ExomeDepth [7], Contra [8], XHMM [9], Excavator [10], Control-FREEC [11], and VarScan2 [12] that were designed for calling CNVs using exome sequencing data. Recently, XHMM has been published which uses exon based intensities normalized by PCA for exome capture biases. A HMM is then applied to segment genomic regions into deletion, diploid, and duplication states. While this has been informative, it’s important to further differentiate copy number states into homozygous and hemizygous deletions and duplications, as well as copy neutral LOHs, a six state model. Accurately sizing large CNVs is possible through merging adjacent CNV call fragments into a single CNV call. Importantly, genotypes need to be assessed in tandem with relative intensity to boost CNV calling sensitivity and specificity. Genotype homozygosity is an important observation to correspond to the drop in intensity mode observed in deletions. Furthermore, genotype banding at 1/3 and 2/3 strongly indicates duplication and supports marginal intensity gain signals. Adding genotype to the CNV HMM algorithm improves confidence of CNV calls. Here, we leverage the most powerful softwarefor integration of their best features. XHMM [9], GATK [13], PennCNV [14]and ParseCNV [15] are the major components of thealgorithm cross-talk advanced here. XHMM provides zPCARD for each exon for each sample which is equivalent to Log R Ratio (LRR). GATK provides reference and alternative allele depths and total depth which can be divided to determine B allele frequency (BAF). Then point-wise BAF values are looked up in exomic segments of LRR for the corresponding sample creating the signal intensity file input for PennCNV.

Methods

Using the Genome Analysis Toolkit (GATK)variant call format (VCF) is a convenient procedure since these files are commonly generated for single nucleotide variant (SNV) detection and genotyping. Other tools require a separate software samtoolsrun with the option pileup which requires accessing the much larger binary alignment map (.bam)file and additional computational time.

The GATK VCF provides allele depth for the reference allele (ADRef) and allele depth for the alternate allele (ADAlt). To meet minimal quality control, (ADRef + ADAlt) must be greater than 0 and FILTER=”.”. Then genotype (GT)= heterozygous (0/1) and GT= homozygous alternate allele (1/1) are placed into 2 different files. B allele frequency (BAF)= ADAlt / (ADRef + ADAlt).

Single sample VCFs are used so no homozygous reference(0/0) genotypes are present. The single sample BAFs are concatenated and sorted by position and chromosome (stable sort). The average BAF in the population of was calculated along with standard deviation and count samples contributing to the average. Assuming there are position specific biases in allele depths, we shifted each SNP population distribution mean to the expected values of 0.5 and 1.0 for heterozygotes and homozygotes, respectively.

We assessed the frequency of data at specific nucleotides across our population of samples.

250,533 SNPs >100 Count Occurrences AB only were found in a population of 2,190 VCFs.

184, 037 SNPs >100 Count Occurrences BB only were found in a population of 2,190 VCFs.

A total of 319,672 SNPs >100 count occurrence combined were therefore considered usable for PennCNV calling to make samples comparable in terms of genome resolution and were included in the PennCNV population frequency of b-allele (.pfb) file which sets the probeset to be used across the population of samples. Chromosome “GL” entries were removed.

We then applied clustering to center the expected value of heterozygotes to 0.5 and homozygotes to 1 in order to improve separation of distributions.

Heterozygous 0/1 genotype samples:

CenteredSampleBAF = sampleBAF – (averageBAF – 0.5)

Homomzygous 1/1 genotype samples:

CenteredSampleBAF = sampleBAF – (averageBAF – 1)

Then 0/1 and 1/1 normalized BAFs are concatenated for each sample.All samples were then concatenated again and sorted to calculate the PFB from the centered sample BAF values. Compile_ pfb.pl failed due to different SNP sets and orders from GATK VCFs requiring a custom procedure and script.

BAFs are used as scan_region.pl query and LRRs are used as scan_region.pl definition to match single base position BAFs to exon spanning LRRs (XHMM zPCARD). Some XHMM zPCARD entries with 2 values comma delimited where one value was expected were excluded.

Results

Instead of using samtools mpileup which requires another lengthy iteration through all. Bam files, we use the GATK VCF, the most commonly generated file from whole exome sequencing studies. Unfiltered allele depth is provided for reference and alternative alleles at each heterozygous or homozygous alternative allele base position exome-wide to eliminate reporting of the frequent homozygous reference allele. The total depth is approximate and filtered for reads with root mean square of the mapping quality of the reads across all samples <255 or with bad mate pairs. The principal equation is quite simple and constitutes: BAF=ADAlt/(ADRef+ADAlt). We considered only those variants passing the quality filter of GATK. To segment the analysis, we separated the BAF values from AB (0/1) and BB (1/1) genotyped base positions. We investigated the effect of read depth on the clustering of BAF on the expected values 0.5 and 1 for AB and BB genotypes, respectively (Figure 1). We observed no strong bias of BAF at low values of depth other than effects of the fractional values constrained to be low values. We also observe lower read depth in the alternative allele than the reference allele at many AB positions. The BAF values exome-wide for a given sample were more widely distributed around 0.5 than those of a SNP array (Figure 2). Running PennCNV resulted in hundreds of erroneous calls and a high BAF standard deviation quality metric centered around 0.11 compared to aSNP array centered around 0.03.