Hanif MK; Zimmermann KH

Research Article

Austin J Comput Biol Bioinform. 2014;1(2): 6.

Graphics Card Processing: Acceleration of Multiple Sequence Alignment

Hanif MK* and Zimmermann KH

Institute of Computer Technology, Hamburg University of Technology, Germany

*Corresponding author: Muhammad Kashif Hanif, Institute of Computer Technology, Hamburg University of Technology, 21071 Hamburg, Germany

Received: September 16, 2014; Accepted: December 01, 2014; Published: December 05, 2014

Abstract

ClustalW is the most widely used heuristic method for multiple sequence alignment. It consists of three stages: distance matrix calculation, guide tree compilation, and greedy-fashion alignment. The high computational complexity demands methods to accelerate the algorithm. In this work, the efficient mapping of the progressive alignment stage onto graphics processing unit by using a combination of wavefront and matrix-matrix product techniques will be studied. The experimental results exhibit one order of magnitude speed-up over the serial version.

Keywords: Alignment; Progressive alignment; Graphics processor card; ClustalW; Performance

Introduction

Sequence alignment is the fundamental technique in molecular biology to compare sequences and to identify regions of similarity that are eventually consequences of structural, functional, or evolutionary relationships [1- 4]. Sequence alignment is performed for all kinds of organic molecules, like DNA, RNA, or protein sequences. Multiple sequence alignment is the technique to align three or more sequences simultaneously. The aligned sequences are obtained by inserting gaps and have equal length. However, multiple sequence alignment is very time-consuming. For instance, optimal dynamic programming methods require O(2^kn^k) steps to simultaneously align k sequences of length O(n) [4]. A variety of heuristic methods have been developed to cope with multiple sequence alignment problems. The most widely accepted heuristic method for aligning multiple sequences is progressive alignment [5,6]. This method aligns more closely related sequences first and then gradually adds more divergent sequences [7]. The alignment accuracy can be improved by assessing the sequences according to their relatedness. A progressive alignment algorithm can handle a larger number of sequences in practical time scales. The most widely used progressive alignment programs are ClustalW [5, 8, 9], T-Coffee [6,10], MAFFT [11,12], and MUSCLE [13,14].

ClustalW is a typical progressive alignment algorithm making use of the policy "once a gap, always a gap", i.e., gaps introduced earlier in the alignment remain valid as new sequences are added [9,15]. It works in three stages (Figure 1). In the first stage, the distances between all pairs of sequences are calculated by pairwise sequence alignment. Pairwise sequence alignment can be calculated by the dynamic programming based method of Needleman-Wunsch [16] or one of its varieties like Smith-Waterman [17] or a fast heuristic method [9,18-20]. The scores of attained pairwise alignments are converted into distances which are input for the subsequent stage [9].

Figure 1: Stages of the ClustalW algorithm. The first stage computes the pairwise distances between the sequences. The guide tree is built in stage two using the distances. In stage three, the sequences are progressively aligned.



Figure 1:  Stages of the ClustalW algorithm. The first stage computes the pairwise distances between the sequences. The guide tree is built in stage two using the distances. In stage three, the sequences are progressively aligned.

In the second stage, the distance matrix calculated in the first stage is used to build the guide tree which serves as a guide for the calculation of the overall multiple sequence alignment. This tree can be constructed by a heuristic phylogenetic method, like Neighbour joining [21] or unweighted pair group method with arithmetic mean (UPGMA) [22].

In the final stage, the sequences are progressively aligned u sing the guide tree. For this, the sequences correspond one-to-one with the leaves of the tree. Three cases can occur:

An inner node (cherry) whose descendants are leaves is associated with the pairwise alignment of the sequences corresponding to these leaves.
An inner node whose descendants are a leaf and an inner node is associated to the alignment given by the sequence and the multiple alignments. This can be achieved by profile-sequence alignment where the given multiple alignment is represented by a statistical representative called profile.
An inner node whose descendants are two inner nodes is associated to the alignment given by the corresponding multiple alignments. This can be attained by profile-profile alignment where the given multiple alignments are represented by statistical representatives.

The root of the tree corresponds to the overall multiple sequence alignment. The basic algorithm uses one weight matrix and fixed gap opening and extension penalties.

This approach, however, is not suitable for more divergent sequences. In this case, sequence weights are calculated from the guide tree. Closely related sequences have lower weights while the divergent ones have higher weights. Moreover, different substitution matrices are used at different alignment stages. New penalties are calculated based on the length and similarity of sequences, weight matrix, and gap positions [4,9]. An example using the tat and vpu proteins from HIV 1 (Human Immunodeficiency Virus) is shown in Figure 2. The complexity of the ClustalW algorithm is shown in Table 1 where n is the number of sequences and l is the average sequence length [20].

Figure 2: ClustalW based sequence alignment between the tat and vpu proteins from HIV 1 calculated from EMBL-EBI using the BLOSUM substitution matrix. The gap opening and the gap extension penalties for pairwise alignments are 10 and 0.1, respectively, and the initial gap penalty and the gap extension penalty for multiple alignments are 25 and 0.2, respectively.



Figure 2:  ClustalW based sequence alignment between the tat and vpu proteins from HIV 1 calculated from EMBL-EBI using the BLOSUM substitution matrix. The gap opening and the gap extension penalties for pairwise alignments are 10 and 0.1, respectively, and the initial gap penalty and the gap extension penalty for multiple alignments are 25 and 0.2, respectively.

Table 1: Complexity of the ClustalW algorithm by stage [20].



  
    Stage
    O(Time)
  
  
    Distance    matrix
    O(n²l²)
  
  
    Guide tree
    O(n³)
  
  
    Progressive    alignment
    O(nl² + n²l)
  
  
    Total
    O(n²l² + n³)



Table 1:  Complexity of the ClustalW algorithm by stage [20].

Many efforts have been made to accelerate the performance of the ClustalW algorithm. ClustalW-MPI [23], Ebedes et al. [24], and pCLUSTAL [25] use MPI to parallelize ClustalW on a cluster. ClustalW-MPI parallelized all three stages and achieved approximately 4.3 speed-up using 16 processors. Ebedes et al. demonstrated a speed-up of 5.5 by parallelizing the stages one and three. Similarly, Tan et al. [26] use MPI/Open MP for symmetric multiprocessors to parallelize the stages one and three. Mikhailov et al. [27] show a 10-fold speed-up by parallelizing all three stages with OpenMP on a shared-memory SGI machine. Aung et al. [28] employed a Field-Programmable Gate Array (FPGA) for acceleration of stage one. Oliver et al. [29] mapped stage one on FPGA and attained a speed-up between 45 and 50. MT-ClustalW [30] utilized pthreads to parallelize all three stages. GPU-ClustalW [31] parallelized the first stage on a GPU with OpenGL to obtain approximately 7 speed-up. MSA-CUDA [19] exploited the parallel architecture of the GPU by implementing all three stages and achieved a maximum average speed-up of approximately 37 for a small number of long sequences. Bassoy et al. [32] formulated a matrix-matrix product algorithm by separating the profile-sequence alignment algorithm into a data dependent and a data independent part to attain an order of magnitude speed-up on a GPU. However, they have ignored the time taken by executing the data dependent part on the CPU which is the reason for their huge speed-up given. Recently, Hanif and Zimmermann [33] described parallel algorithms for profile-profile alignment using matrix-matrix product and the wavefront approach attaining a 20-fold average speed-up for the wavefront approach. The results have shown that the matrix-matrix product and the wavefront methods are the most promising for profile-sequence alignment and profile-profile alignment, respectively.

A Graphics Processing Unit (GPU) is a highly parallel many-core streaming architecture which can execute hundreds of threads in a concurrent manner. The data parallel architecture of a GPU is particularly suitable to perform computation intensive tasks. GPUs offer orders of magnitude more computation power than CPUs and are becoming increasingly popular for general purpose computations to attain high speed-ups. A large set of problems in molecular dynamics, physics simulations, and scientific computing [34] have been tackled by mapping them onto a GPU. NVIDIA has introduced a GPU programming model called Compute Unified Device Architecture (CUDA) which enables the programmer to write C-like functions called kernels with some extensions that leverages programmers to efficiently use the graphics API. Each kernel is executed by a batch of parallel threads. CUDA provides three key abstractions: a hierarchy of thread groups, shared memories, and barrier synchronization [34].

In this paper, a combination of matrix-matrix product and wavefront methods will be used to parallelize the progressive alignment stage of ClustalW. The paper is organized as follows: Section 2 provides a method to accelerate the progressive alignment stage of ClustalW and section 3 evaluates its performance and compares it to the CPU implementation.

Materials and Methods

The performance of the ClustalW algorithm can be improved using the parallel architecture of the GPU. This particularly holds for the third stage, the alignment of the sequences using a guide tree.

First, a matrix-matrix product based approach to profile-sequence alignment will be introduced [35]. The technique of Bassoy et al. [32] is similar, but required additional memory and clock cycles.

Given a multiple alignment of length m by its profile P on the alphabet Σ' = Σ'{-} and a sequence x of length n, the score between a column p of the profile and a character a∈ Σ' is [4]

σ (p, a) = \sum_{b \in Σ^{'}} σ (a, b) . p_{b} . (1)

Profile-sequence alignment algorithm can be converted into a matrix-matrix product by separating the data dependent and independent parts. First, the data independent part calculates three scalar products. The diagonal entries of the forward table are stored in m x n matrix D. Two vectors h and v of lengths m and n are used to store the vertical and horizontal entries of the forward table. Second, these values are used in the data dependent part for calculation of the forward table entries.

The profile column ^- p= (0,...,0,1)^T represents a column consisting of blanks having relative frequency of blank 1. Take the extended alphabet Σ' = {a₁,...,a₁}, where a₁ equals blank, and assign

w_a=(σ(a,a₁),...,σ(a,a₁))T, a∈Σ'. (2)

The vectors h and v can be computed as

h_{j} = σ (-_{p}, x_{j}) = \sum_{b} σ (x_{j}, b) \cdot -_{p} = -_{p}^{T} \cdot w_{x_{j}}, (3)

v_{i} = σ (p_{i}, -) = \sum_{b} σ (-, b) \cdot p_{i, b} = p_{i}^{T} \cdot w_{-} . (4)

The matrix D of size m x n is calculated as

D_{i, j} = σ (p_{i}, x_{j}) = \sum_{b} σ (x_{j}, b) \cdot p_{i, b} = p_{i}^{T} \cdot w_{x_{j}}, (5)

The calculations of h, v, D can be written into a matrix-vector product. For this, take the l x n matrix W

W = (w_x1,...,w_xn).

The values of the vectors h and v and the matrix D are determined as

v=PT .w-, (6)

h = -_{p}^{T} \cdot W, (7)

D_{i} = p_{i}^{T} \cdot W, (8)

To calculate first column of the forward table, take the lower triangular m x m matrix B_m

B_{m} \cdot v = (\begin{matrix} 1 \\ 1 & 1 \\ \cdot \cdot \cdot & \cdot \cdot \cdot \\ 1 & 1 & \cdot \cdot \cdot & 1 \end{matrix}) [(\begin{matrix} p_{1}^{T} \\ \cdot \cdot \cdot \\ o_{m}^{T} \end{matrix}) \cdot w_{-}] = (\begin{matrix} p_{1}^{T} \cdot w_{-} \\ p_{1}^{T} \cdot w_{-} + p_{2}^{T} \cdot w_{-} \\ ... \\ p_{1}^{T} \cdot w_{-} + p_{2}^{T} \cdot w_{-} + ... + p_{m}^{T} \cdot w_{-} \end{matrix}) . (9)

Similarly, the first row is calculated by having the n x n upper triangular matrix B_n

h \cdot B_{n} = [-_{p}^{T} \cdot (w_{x_{1}} w_{x_{2}} ... w_{x_{n}})] (\begin{matrix} 1 & \begin{matrix} 1 \\ 1 \end{matrix} & \begin{matrix} \cdot \cdot \cdot \\ \cdot \cdot \cdot \\ \cdot \cdot \cdot \end{matrix} & \begin{matrix} 1 \\ 1 \\ \cdot \cdot \cdot \\ 1 \end{matrix} \end{matrix}) = (-_{p}^{T} \cdot w_{x_{1}} \cdot \cdot \cdot -_{p}^{T} \cdot w_{x_{1}} + \cdot \cdot \cdot + -_{p}^{T} \cdot w_{x_{n}}) . (10)

This gives the algorithm PROSEQALIGNMATVECPRODV2.

Algorithm 1: PROSEQALIGNMATVECPRODV2(x,P).

Require: sequence x = x₁ . . . x_n and profile P = p₁ . . . p_m

S_0,0 ← 0 {initialization}
v ← P^T • w-
h ← -^T_p • W
s_*,0 ← B_mv
S₀,* ← hB_n
for _i ← 1 to m do {calculation}
D_i ← p^T_i •W
end for
for _i ← 1 to m do {maximization}
for _j ← 1 to n do
S_i,j←max{S_i-1,j+v_i,S_i,j-1+h_j,S_i-1,j-1+D_ij}
end for
end for
return S

The matrix D can be computed by matrix multiplication as

D = P^T. W. (11)

This resultant algorithm is PROSEQALIGNMATPRODV2.

Five versions of profile-sequence alignment algorithm on GPU have been considered.

Algorithm 2: PROSEQALIGNMATPRODV2(x,P).

Require: sequence x = x₁ . . . x_n and profile P=p₁...p_m

S_0,0 ← 0 {initialization}
v ← P^T • w-
h ← -^T_p • W
s_*,0 ← B_mv
S₀,* ← hB_n
D ← P^T.W
for _i ← 1 to m do {maximization}
for _j ← 1 to n do
S_i,j←max{S_i-1,j+v_i,S_i,j-1+h_j,S_i-1,j-1+D_ij}
end for
end for
return S

MatVecProd V1: Matrix-vector product implementation using cublasSgemv [32].
MatVecProd V2: Matrix-vector product implementation using cublasSgemv.
MatProd V1: Matrix-vector product implementation using cublasSgemm [32].
MatProd V2: Matrix-vector product implementation using cublasSgemm.
SMWavefront 256: Wavefront approach using shared memory having block size 256 [35].

For MatVecProd V1 and MatProd V1, results are taken up to sequence length 6,000. A sequence length of 10,000 requires approximately 1145 MB to store matrices which is well beyond the capacity of available global memory (1024 MB). A maximum speed-up factor of approximately 28 is attained when compared with optimized Intel CPU implementation (Figure 3). Parallel versions of profile-profile alignment are given in [33]. The results show that a mixture of wavefront and matrix-matrix product methods can be useful for the parallelization of the progressive alignment stage.

Download PDF

Citation: Hanif MK and Zimmermann KH. Graphics Card Processing: Acceleration of Multiple Sequence Alignment. Austin J Comput Biol Bioinform. 2014;1(2): 6. ISSN: 2379-7967

Instruction for Authors

Submit Your Article