Distribution of Unique Sequences in the Human Genome

Research Article

Austin J Comput Biol Bioinform. 2015;2(1): 1010.

Distribution of Unique Sequences in the Human Genome

Kazuharu Misawa*

Advanced Center for Computation and Communication, The Institute of Physical and Chemical Research, Japan

*Corresponding author: Kazuharu Misawa, Advanced Center for Computation and Communication, Researcher, RIKEN (The Institute of Physical and Chemical Research) Hirosawa 2-1, Wako, Saitama, 351-0198, Japan

Received: December 19, 2014; Accepted: January 20, 2015; Published: February 16, 2015

Abstract

Programmable sequence-specific endonucleases are powerful tools for genome alteration with high precision. For example, the CRISPR system is an efficient tool for genome engineering in eukaryotic cells by simply specifying a 20-bp targeting sequence within its guide RNA. When studying large genomes, however, the design of target sequences is complicated by the redundancy of sequences. The distribution of unique sequences in the genome is of interest. In this paper, I describe the development of a novel method, UF, for detecting unique 20-bp sequences in entire genomes. UF stands for "Unique Finder". By using UF, the distribution of unique sequences in the human genome was investigated. It was found that 60% of the human genome is unique on average. However, non-unique regions of human genome are concentrated on centromeres and terminal regions of the chromosomes. The proportions of unique sequences are about 80% in the rest part of the genome. The program for obtaining unique sequences is available at https://sourceforge.jp/projects/ parallelgwas/releases/

Keywords: Unique sequences; human genome; hash code; centromere; chromosome terminal region

Abbreviations

BP: Base Pairs; MB: Mega Bases1

Introduction

Targeted nucleases are useful tools for genome alteration with high precision. The RNA-guided Cas9 nuclease from the microbial Clustered Regularly Interspaced Short Palindromic Repeats (CRISPR) adaptive immune system can be used as an efficient genome engineering tool in eukaryotic cells including human cells [1]. CRISPR requires a 20-bp targeting sequence [2]. When studying large genomes, however, the design of oligonucleotides for CRISPR is complicated by the redundancy of sequences. In humans, various types of repetitive sequences account for approximately 50% of the genome [3]. Such non-unique sequences cannot be CRISPR targets, because they cause off-target reaction. In this paper, I developed a new program, UF, for detecting 20-bp sequences that are unique within a genome. To detect sequences that are unique within a genome, a new program, UF, was developed. A word is some defined number of letters. In this study, the word size set to be 20-bp. UF is based on the hash code which was used as described below. The probability of collisions to evaluate how many hash codes was estimated by using various prime numbers. By using UF, distribution of unique sequences in the human genome was investigated in this study.

Materials and Method

Overview of the algorithm of UF

The basic idea of UF is similar to the seeding algorithm of BLAST [4]. The list consists of all words (w-mers). The bases in the sequences are labelled 1, 2, 3, and 4 for T, C, A, and G, respectively. Other characters, such as N, are labelled 0. Then, an index, x, of a word can be calculated as

where ck is the label of k-th nucleotide of the word. Using this formula, instances of the word can be counted if there is sufficiently large memory. Counting the occurrences of words takes up a lot of memory. In this study, the word size, w, was set to be 20-bp. Thus, a word can be used as an index into an array of the size 520 = 95,367,431,640,625. The memory size for allocating such an array is much larger than what is available at present.

Hash table

In this paper, one hash code, h, is assigned to x by the modular arithmetic algorithm. The modular arithmetic algorithm is used for DNA pooling in an experimental study [5]. Suppose x is an index of a word and p is a prime number. Using modular arithmetics, we can obtain a hash code, h, as follows:

h = x (mod p) (2)

To count the occurrences of h, an array with its size p is required. Note that equation (2) yields the same hash code for words whose indexes are h, h + p, h + 2p, . . ., where 0 <h<p. Complementary words on the reverse strand were also analysed simultaneously. If the number of occurrences of h is 1, then the word whose index is x is unique.

Prime numbers

It must be noted that two or more distinct words have the same hash value. In computer science, this situation is called a collision. Because of collisions, having a unique hash code is a sufficient condition for being a unique word, but it is not a necessary condition. The proportion of collision is expected to be reduced by using sufficiently large number of hash codes. Prime numbers used in this study are listed in (Supplementry Table 1). These prime numbers were obtained using the method widely known as "the sieve of Eratosthenes"[6].