The improvement of algorithm using cache technique allows
us to reduce computational cost because the results calculated
in previous times not to be calculated in the progressive
process. Besides, Grid computing is also an effective
environment to deploy the Multiple Sequence Alignment
program. The combination of two techniques brings us many
benefitsto increase program’s execution speed. However, the
parallel cache requires large amount of shared storage space,
and network bandwidth is also a factor affecting the
processing speed of the program. Our future work is studying
suitable cache techniques to solve next-generation sequencing
problems. In addition, we continue investigating the
usefulness of parallel and distributed systems to boost the
performance of the bioinformatics programs.
6 trang |
Chia sẻ: huongthu9 | Lượt xem: 435 | Lượt tải: 0
Bạn đang xem nội dung tài liệu Multiple Sequence Alignment on the Grid Computing using Cache Technique ISSN 2047-3338, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
International Journal of Computer Science and Telecommunications [Volume 3, Issue 7, July 2012] 46
Journal Homepage: www.ijcst.org
Abstract—Multiple sequence alignment is an important
problem and popular in the molecular biology. This is a basic
problem that its solution could be used to proof and discover the
similarity of the new sequence with other exist sequences; to
define the evolution process of the family’s sequences; as well as
to support the protein structure prediction, etc. In this paper, we
consider and improve the global progressing algorithm by using
the cache storage technique to make the best of the previous
alignment results. This algorithm and cache technique were been
developed on the distributed system and Grid computing
environment in order to decrease the algorithms execution time,
as well as increase the quantity and size of input sequences.
Index Terms—Biological Sequences, DNA, Protein, Grid
Computing and Distributed Computing
I. INTRODUCTION
ULTIPLE Sequence Alignment (MSA) is a sequence
alignment of three or more biology sequences such as
DNA, RNA, or protein. The result of the task can be
used to infer sequence homology and conduct phylogenetic
analysis to assess the sequences shared evolutionary origins
[2], [3]. The accuracy and execution time are major factors
requiring the attention of researchers. Some popular
approaches used for MSA algorithms are exact solution,
progressive methods, iterative methods, or methods based on
Hidden Markov Models. Each method has its advantages and
disadvantages. Biologists are the persons who decided suitable
method to process their biological data.
The multiple sequence alignment is the problem with
exponential complexity. Over the years, researcher efforts in
finding different algorithms or mathematical models that
require low computational cost as well as ensure accuracy. A
lot of popular algorithms were proposed such as ClustalW,
T-Coffee. However, when the number of sequences increases,
the software’s processing speed becomes slow because the
limitation of a single computer. Thus, parallel and distributed
systems are used to gain the efficiency for the programs.
1HCM City University of Technical Education, Vietnam Ministry of
Education and Training, vinhlv@fit.hcmute.edu.vn
2 Institute of Applied Mechanics and Informatics, Vietnam Academy of
Science and Technology, {tvlang,ntthudu}@vast-hcm.ac.vn
3 Lac Hong University, Vietnam Ministry of Education and Training
In the field of applying parallel, Grid or Cloud computing to
solve the MSA problem, last decade witnessed many
contributions. ClustalW-MPI [14] is the parallel
implementation of ClustalW software, using MPI (Message
Passing Interface) library. It can be deployed on workstation
clusters with distributed memory architecture. MAFFT [15] is
also modified to parallelized version by using the POSIX
Threads library [16]. Other researchers aimed to get the
usefulness of Grid or Cloud computing for their improvement
[17], [18].
Cache technique is another strategy applied in MSA
algorithms. It helps programs to reuse the results of previous
works so that they could decrease execution time. Zola [5]is
the first person use the technique to improve MSA algorithms.
He proposed an MSA program and developed it with CaLi
cache library [7]. Xun Tu et al. [19] also proposed a new
caching technique with two novel cache replacement policies
and did some experiments in progressive algorithms.
However, their approach was not deployed on distributed
system.
This paper presents the research results on applying cache
technique for the global progressing algorithm to reduce
computational cost. In addition, it also describes the algorithm
and cache technique on Grid computing environment to
decrease execution time.
The paper’s content includes 4 sections. In the first and
second section, the introduction and related works were
described; the Pair-wise Sequence Alignment (PSA) and
Multiple Sequence Alignment algorithms were discussed more
detail. The main content of paper is the section 3, where the
reasons and works were done to improve the global
progressing algorithms was presented. The last sections show
the experimental results and the discussions.
II. MULTIPLE SEQUENCE ALIGNMENT ALGORITHM
The sequence alignment algorithms can be classified into
two types: the Pairwise Alignment and the Multiple
Alignment. The Pairwise Alignment algorithm is commonly
used as the first step in MSA algorithm.
A. Pairwise Sequence Alignment
In biology, the evolution of organisms is caused by many
different factors. However, the major factor is the change in
the structure of DNA/RNA including insertion or deletion
Le Van Vinh
1
, Tran Van Lang
2
, Nguyen Thi Thu Du
2
and Vo Hong Bao Chau
3
M
Multiple Sequence Alignment on the Grid Computing
using Cache Technique
ISSN 2047-3338
Le Van Vinh et al. 47
mutation, leading to protein structural change. The Pairwise
Sequence Alignment algorithms can help biologists discover
the changes. Thus, they can find out the relationship between
two individual organisms. This problem is fundamental for
many other problems in bioinformatics such as finding
homologous sequences, motif finding or MSA.
For example, there are two sequences with different length:
CGACTAAATCAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTTCG
CGACTTCATCAAACCATTTATTGCTACTCGTCCGGGAGTTTTTACG
The one possible alignment can be seen as follows:
CGACTAAATCAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTT-CG
CGACTTCATCAAAC--CATTTATTGCTACTCGTCCGGGAGT-TTTTACG
The Pairwise Sequence Alignment algorithm arranges and
finds similar regions between two sequences. In some regions,
the special characters, called gap “-“, are added. There are
two groups of those algorithms such as the global alignment
methods (Fig. 1) and the local alignment methods (Fig. 2).
Fig. 1. Global Alignment Algorithm
Fig. 2. Local Alignment Algorithm
The global alignment algorithms aim to align two sequences
in their entirety how to get the largest number of matched
elements and the least number of gaps. Some of the popular
global sequence alignment algorithms are Needleman-Wunch
[8], GLASS [12]. In contrast, the local alignment algorithms
[1] attempt to find the best similar subsequence within two
sequences. Some the local alignment algorithms are well-
know such Smith-Waterman, FASTA, and BLAST.
B. Multiple Sequence Alignment
The problem of Multiple Sequence Alignment are defined
as follows:
Given n sequences S1, S2, , Sn, multiple sequence
alignment is obtained by inserting gap characters (“-“) in
each sequence to make them all the same length l.
Thus, the sequences can be stored to an array with n row
and l column. Each item of array is the character standing for
the four nucleic acid bases (A, C, G, T) or gap character
(“-“). There exist several strategies are used in MSA
algorithms:
C. Exact algorithms
The algorithm is based on the Needleman-Wunsch
algorithm [8]. It always produces the optimal alignment by
using a dynamic backtracking algorithm. However, the
weakness of the algorithms is that the amount of time and
memory requiring tends to grow exponentially with the
number of sequences. There are some MSA algorithms uses
this strategy including MSA, DCA.
D. Progressive algorithms
The progressive algorithm (such as ClustalW) is known as
the simplest and most effective strategy in case of aligning
sequences in little time and with little memory[3], [6]. The
algorithm obtains to find the sequences closely related by
using guide tree. It was initially proposed by Hogeweg [10]
and later improved by Feng-Dolittle [9] and Taylor [11]. The
algorithm’s idea is as follows: The first step is to perform
pairwise sequence aligning for all pairs of sequences. In the
next step, guide tree is constructed and used in the processing
step. In this step, all sequences are aligned based on guide
tree. This algorithm is used in our research to illustrate the
effectiveness of the use of the cache and Grid computing
technique.
E. Iterative algorithms
The algorithms work similarly to progressive strategy but
repeatedly align the initial sequences and adding new
sequence to grow the number of sequence in alignment
process. It aims to reduce the inherent errors occurring in
progressive method. This method is commonly used for local
alignment algorithms. MUSCLE is known as the popular
multiple sequence alignment using iterative method.
F. HMM-based algorithms
Hidden Markov Model is method using probabilistic model
can produces both global and local alignment. The algorithms
offer significant improvements in computational cost. Several
HMM-based MSA software are popular such as POA, SAM.
III. CACHE TECHNIQUE AND GRID COMPUTING FOR
PROGRESSIVE ALGORITHMS
In this paper, the progressive algorithm was improved by
appling cache technique in order to reduce computational cost.
A. Progressive Algorithm
There are three basic steps as follows in this algorithm:
• Step 1: Pairwise sequence alignment
This step aims to align all pairs of sequences and determine
the similarity between those sequences. Some pairwise
sequence alignment algorithms can be used in this step, for
instance, Sum of pairs, Linear – Sankoff. For each pair of
sequences, the algorithm produces a score value that shows
the distance between two sequences. A distance matrix n n
(n is the number of sequences) is used to store all values.
• Step 2: Construct a guide tree
Create guide tree by using a bottom-up clustering method
NJ (Neighbor-Joining) proposed by Saitou and Nei. This
×
International Journal of Computer Science and Telecommunications [Volume 3, Issue 7, July 2012] 48
Fig. 3. Progressing
phonetic tree shows the relationship between all given
sequences.
• Step 3: Progressing
Using guide tree, group the sequences closely relative
together first. Then, combine the group closely relative
together. The process repeated until all sequences are grouped.
We have three kinds of grouping (Fig. 3): Sequence to
sequence, sequence to profile, and profile to profile. (Profile is
a group of sequences closely relative).
B. Cache Technique
The cache is a collection of copies of objects accessed
frequently by users. The objects are memory pages or the site,
the records of the database or files. Users may be operating
systems or computing program. One common application of
this technique is web cache. By using proxy servers or
browser cache, the documents (or data) located temporal
locality or spatial locality are brought closer to users. Thus, it
reduces response time required by users significantly.
However, one of the major factors that affecting the system’s
performance is cache replacement policy when the cache is
full. The Cache replacement policy aims to get optimal
parameters such as ratio of document found, latency time, total
cost.
The other important feature of the cache system is data
synchronization so that the users always access the latest
documents (or data). This technique is similar to the one used
to synchronize files or data of distributed database systems.
The main feature is concerned in data synchronization process
is data consistency. It is categorized into two types: weak
consistency and strong consistency [4].
The weakness of cache technique is that it uses large
amount of storage space. So depending on the system’s
capabilities and users’ needs, they will decide whether to use
the technique or not.
In this modified algorithm, the progressing step (step 3)
reuses the result of the Pairwise Sequence Alignment step
(step 1). For instance, in the first test there are five sequences
1, 2, 3, 4, 5. In step 1, the following pairs of sequence are
aligned: (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4),
(3, 5), (4, 5). Then the guide tree is constructed (Fig. 3). In
step 3, two pairs of sequence (1, 5), (2, 3) are realigned.
In addition, to align six sequences 1, 2, 3, 4, 5, 6 (sequences
from 1 to 5 have aligned in the first test) there are following
steps: Step 1 needs to align 15 pairs of sequences: (1,2), (1,3),
(1,4), (1,5), (2,3), (2,4), (2,5), (3,4), (3,5), (4,5), (1,6), (2,6),
(3,6), (4,6), (5,6). In this case, the 10 pairs of sequences have
to be realigned: (1,2), (1,3), (1,4), (1,5), (2,3), (2,4), (2,5),
(3,4), (3,5), (4,5).
Fig. 4. The Cache system [7]
Two mentioned above examples give a motivation to use
cache technique so that it reuses the already aligned pairs in
the previous steps or executed application.
To demonstrate above ideas, the algorithms were modified
from available source code, using CaLi library [7] and MSA
program using the cache technique proposed by Zola [5].
Moreover, we developed for global progressing algorithm, and
deployed them the grid computing system.
The cache system architecture is described in Fig. 4. Each
worker has a CaLi cache manager that manages its cache and
communicates with other cache manager through CaLi cache
communicator. The CaLi cache is deployed using MPI-2
(Message Passing Interface standard 2) library, so it could be
run on both the parallel system and the Grid computing
system. The CaLi cache uses hash table structure to store and
retrieve data. There are two level of cache: L1 cache and L2
cache. Whereas the L1 cache allows program to store and
retrieve data on internal memory, the L2 cache allows
program to do on external memory. Depending on the
system’s capability, users will choose to use L1 cache, L2
cache or both of them.
C. Grid computing
Grid computing is known as a distributed system that
connects many computer systems having different hardware
and platforms (operating system, system software). It allows
applications to run in parallel on multiple machines, clusters,
or systems (VO - Virtual Organization). The system is suitable
for solving the problems that require a large amount of
computation as well as storage capacity. In our research, the
Grid system is used for execution bioinformatics problems.
IV. EPERIMENTAL RESULTS
In this paper, the multiple sequence alignment using cache
technique on IOIT-HCM Grid System. This system is
connected to PRAGMA system (Fig. 5). There are two
clusters with two head nodes: venus.ioit-hcm.ac.vnand
moon.ioit-hcm.ac.vn. Each cluster includes 5 nodes (1 head
node and 4 worker nodes).
In these experiments, the data from the Data Bank NCBI
(National Center for Biotechnology Information) were used.
The system provides resources for researcher in the field of
Bioinformatics and computational biology.
A. First data
In this test pattern, the program for 60 sequences was
executed. Each sequence contains 800 characters. The result
shows as follows (Fig. 6a, 6b):
Le Van Vinh et al. 49
Fig. 5. IOIT-HCM Grid System
Using parallel algorithm without cache, the time is 344,42
sec ≈ 5,74 min.
Using parallel algorithm with cache, the time is 270,17 sec
≈ 4,5 min.
B. Second Data
In the second test data, the program for 90 sequences with
800-character length was executed. 60 sequences in this test
have used in the first data.
Fig. 6a. Parallel execution
Fig. 6b. Parallel execution with cache
Fig. 7a. Parallel execution
International Journal of Computer Science and Telecommunications [Volume 3, Issue 7, July 2012] 50
Fig. 7b. Parallel execution with cache
Fig. 7c. Parallel execution with cache
The execution time decreases significantly in the second test
as follows:
• Using parallel algorithm without cache: 727,78 sec≈
12,13 min
• Using parallel algorithm with cache: 534,62 sec≈ 8,91
min
• Using parallel algorithm with cache the second time (for
the same data): 19,46 sec≈ 0,32 min.
The output of program is guide tree and list of aligned
sequences:
C. Guide Tree
D. List of aligned sequences
V. CONCLUSIONS AND FUTURE WORKS
The improvement of algorithm using cache technique allows
us to reduce computational cost because the results calculated
in previous times not to be calculated in the progressive
process. Besides, Grid computing is also an effective
environment to deploy the Multiple Sequence Alignment
program. The combination of two techniques brings us many
benefitsto increase program’s execution speed. However, the
parallel cache requires large amount of shared storage space,
and network bandwidth is also a factor affecting the
processing speed of the program. Our future work is studying
suitable cache techniques to solve next-generation sequencing
problems. In addition, we continue investigating the
usefulness of parallel and distributed systems to boost the
performance of the bioinformatics programs.
Le Van Vinh et al. 51
ACKNOWLEDGMENT
Thank to the research group of Czestochowa University of
Technology Poland and Zola. They have developed CaLi
library [7] and MSA program using the cache technique with
their proposed methods [5].We used their library and source
codes for our modified algorithms.
REFERENCES
[1] Waqar Haque, Alex Aravind, Bharath Reddy. Pairwise
Sequence Alignment Algorithms – A Survey. ISTA’09, ACM
978, 2009.
[2] Yang, Hong Zhao, Ding-Yuan. Research on Multiple
Sequences Alignment Algorithm. IEEE ICCIS, 2011.
[3] Cedric Notredame. Recent revolutions of Multiple Sequence
Alignment Algorithms: A survey. PLoS Computational
Biology. Volume 3, 2007.
[4] Balamash A., Krunz M.. An overview of web caching
replacement algorithms. IEEE Comm. Surv. & Tut., 6(2):44–
56, 2004.
[5] Jaroslaw Zola, Parallel Server for Multiple Sequence
Alignment, L’Institut National Polytechnique de Grenoble,
France, 12/2005.
[6] Grasso C, Lee C. Combining partial order alignment and
progressive multiple sequence alignment increases alignment
speed and scalability to very large alignment problems.
Bioinformatics 20, 2004.
[7] Cali library:
[8] NeedlemanS. B. and WunschC. D., A general method
applicable to the search for similarities in the aminoacid
sequence of two proteins. Journal of Molecular Biology, vol.
48, pp. 443-453, 1970.
[9] Feng D.F., Doolittle R.F.,Progressive sequence alignment as
a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25,
351-360, 1987.
[10] Hogeweg P, Hesper B.,The alignment of sets of sequences and
the construction of phylogenetic trees. An integrated method.
J. Mol. Evol. 20, 175-186, 1984.
[11] Taylor W.R.,A flexible method to align large numbers of
biological sequences. J. Mol. Evol. 28, 161-169, 1988.
[12] L. Batzoglou, J. Pachter, B. Mesirov, B. Berger, and E. S.
Lander, Human and mouse gene structure: comparative
analysis and application to exon prediction. RECOMB 00:
Proc of the 4th Int'l Conference on Computational Molecular
Biology, 2000, pp. 46-53.
[13] The library for parallel programming MPI:
forum.org/docs/docs.html
[14] Kuo-Ben Li. ClustalW-MPI: ClustalW analysis using
distributed and parallel computing. Bioinformatics, 2003.
[15] Katoh, Ketal. MAFFT: a novel method for rapid multiple
sequence alignment based on fast Fourier transform.
NucleicAcidsRes., 2003.
[16] Kazutaka Katoh, Hiroyuki Toh. Parallelization of the MAFFT
multiple sequence alignment program. Bioinformatics, 2010.
[17] G. Sudha Sadasivam, G. Baktavatchalam. A novel approach
to Multiple Sequence Alignment using Hadoop Data Grids.
MDAC’10, 2010.
[18] K. Arumugam, Y.S. Tan, B.S. Lee, R. Kanagasabai, Cloud-
enabling Sequence Alignment with Hadoop MapReduce: A
Performance Analysis. IPCBEE, 2012.
[19] Xun Tu, Kajal T. Claypool and Cindy X.Chen. ACache:
Using Caching to Improve the Performance of Multiple
Sequence Alignments. IEEE SSDBM’06, 2006.
Các file đính kèm theo tài liệu này:
- multiple_sequence_alignment_on_the_grid_computing_using_cach.pdf