Multiple Sequence Alignment on the Grid Computing using Cache Technique ISSN 2047-3338

The improvement of algorithm using cache technique allows us to reduce computational cost because the results calculated in previous times not to be calculated in the progressive process. Besides, Grid computing is also an effective environment to deploy the Multiple Sequence Alignment program. The combination of two techniques brings us many benefitsto increase program’s execution speed. However, the parallel cache requires large amount of shared storage space, and network bandwidth is also a factor affecting the processing speed of the program. Our future work is studying suitable cache techniques to solve next-generation sequencing problems. In addition, we continue investigating the usefulness of parallel and distributed systems to boost the performance of the bioinformatics programs.

6 trang | Chia sẻ: huongthu9 | Lượt xem: 345 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Multiple Sequence Alignment on the Grid Computing using Cache Technique ISSN 2047-3338, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

International Journal of Computer Science and Telecommunications [Volume 3, Issue 7, July 2012] 46 Journal Homepage: www.ijcst.org Abstract—Multiple sequence alignment is an important problem and popular in the molecular biology. This is a basic problem that its solution could be used to proof and discover the similarity of the new sequence with other exist sequences; to define the evolution process of the family’s sequences; as well as to support the protein structure prediction, etc. In this paper, we consider and improve the global progressing algorithm by using the cache storage technique to make the best of the previous alignment results. This algorithm and cache technique were been developed on the distributed system and Grid computing environment in order to decrease the algorithms execution time, as well as increase the quantity and size of input sequences. Index Terms—Biological Sequences, DNA, Protein, Grid Computing and Distributed Computing I. INTRODUCTION ULTIPLE Sequence Alignment (MSA) is a sequence alignment of three or more biology sequences such as DNA, RNA, or protein. The result of the task can be used to infer sequence homology and conduct phylogenetic analysis to assess the sequences shared evolutionary origins [2], [3]. The accuracy and execution time are major factors requiring the attention of researchers. Some popular approaches used for MSA algorithms are exact solution, progressive methods, iterative methods, or methods based on Hidden Markov Models. Each method has its advantages and disadvantages. Biologists are the persons who decided suitable method to process their biological data. The multiple sequence alignment is the problem with exponential complexity. Over the years, researcher efforts in finding different algorithms or mathematical models that require low computational cost as well as ensure accuracy. A lot of popular algorithms were proposed such as ClustalW, T-Coffee. However, when the number of sequences increases, the software’s processing speed becomes slow because the limitation of a single computer. Thus, parallel and distributed systems are used to gain the efficiency for the programs. 1HCM City University of Technical Education, Vietnam Ministry of Education and Training, vinhlv@fit.hcmute.edu.vn 2 Institute of Applied Mechanics and Informatics, Vietnam Academy of Science and Technology, {tvlang,ntthudu}@vast-hcm.ac.vn 3 Lac Hong University, Vietnam Ministry of Education and Training In the field of applying parallel, Grid or Cloud computing to solve the MSA problem, last decade witnessed many contributions. ClustalW-MPI [14] is the parallel implementation of ClustalW software, using MPI (Message Passing Interface) library. It can be deployed on workstation clusters with distributed memory architecture. MAFFT [15] is also modified to parallelized version by using the POSIX Threads library [16]. Other researchers aimed to get the usefulness of Grid or Cloud computing for their improvement [17], [18]. Cache technique is another strategy applied in MSA algorithms. It helps programs to reuse the results of previous works so that they could decrease execution time. Zola [5]is the first person use the technique to improve MSA algorithms. He proposed an MSA program and developed it with CaLi cache library [7]. Xun Tu et al. [19] also proposed a new caching technique with two novel cache replacement policies and did some experiments in progressive algorithms. However, their approach was not deployed on distributed system. This paper presents the research results on applying cache technique for the global progressing algorithm to reduce computational cost. In addition, it also describes the algorithm and cache technique on Grid computing environment to decrease execution time. The paper’s content includes 4 sections. In the first and second section, the introduction and related works were described; the Pair-wise Sequence Alignment (PSA) and Multiple Sequence Alignment algorithms were discussed more detail. The main content of paper is the section 3, where the reasons and works were done to improve the global progressing algorithms was presented. The last sections show the experimental results and the discussions. II. MULTIPLE SEQUENCE ALIGNMENT ALGORITHM The sequence alignment algorithms can be classified into two types: the Pairwise Alignment and the Multiple Alignment. The Pairwise Alignment algorithm is commonly used as the first step in MSA algorithm. A. Pairwise Sequence Alignment In biology, the evolution of organisms is caused by many different factors. However, the major factor is the change in the structure of DNA/RNA including insertion or deletion Le Van Vinh 1 , Tran Van Lang 2 , Nguyen Thi Thu Du 2 and Vo Hong Bao Chau 3 M Multiple Sequence Alignment on the Grid Computing using Cache Technique ISSN 2047-3338 Le Van Vinh et al. 47 mutation, leading to protein structural change. The Pairwise Sequence Alignment algorithms can help biologists discover the changes. Thus, they can find out the relationship between two individual organisms. This problem is fundamental for many other problems in bioinformatics such as finding homologous sequences, motif finding or MSA. For example, there are two sequences with different length: CGACTAAATCAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTTCG CGACTTCATCAAACCATTTATTGCTACTCGTCCGGGAGTTTTTACG The one possible alignment can be seen as follows: CGACTAAATCAAGCCGCCTTTATTGCCTCACGCCCAGGGGTCTTTT-CG CGACTTCATCAAAC--CATTTATTGCTACTCGTCCGGGAGT-TTTTACG The Pairwise Sequence Alignment algorithm arranges and finds similar regions between two sequences. In some regions, the special characters, called gap “-“, are added. There are two groups of those algorithms such as the global alignment methods (Fig. 1) and the local alignment methods (Fig. 2). Fig. 1. Global Alignment Algorithm Fig. 2. Local Alignment Algorithm The global alignment algorithms aim to align two sequences in their entirety how to get the largest number of matched elements and the least number of gaps. Some of the popular global sequence alignment algorithms are Needleman-Wunch [8], GLASS [12]. In contrast, the local alignment algorithms [1] attempt to find the best similar subsequence within two sequences. Some the local alignment algorithms are well- know such Smith-Waterman, FASTA, and BLAST. B. Multiple Sequence Alignment The problem of Multiple Sequence Alignment are defined as follows: Given n sequences S1, S2, , Sn, multiple sequence alignment is obtained by inserting gap characters (“-“) in each sequence to make them all the same length l. Thus, the sequences can be stored to an array with n row and l column. Each item of array is the character standing for the four nucleic acid bases (A, C, G, T) or gap character (“-“). There exist several strategies are used in MSA algorithms: C. Exact algorithms The algorithm is based on the Needleman-Wunsch algorithm [8]. It always produces the optimal alignment by using a dynamic backtracking algorithm. However, the weakness of the algorithms is that the amount of time and memory requiring tends to grow exponentially with the number of sequences. There are some MSA algorithms uses this strategy including MSA, DCA. D. Progressive algorithms The progressive algorithm (such as ClustalW) is known as the simplest and most effective strategy in case of aligning sequences in little time and with little memory[3], [6]. The algorithm obtains to find the sequences closely related by using guide tree. It was initially proposed by Hogeweg [10] and later improved by Feng-Dolittle [9] and Taylor [11]. The algorithm’s idea is as follows: The first step is to perform pairwise sequence aligning for all pairs of sequences. In the next step, guide tree is constructed and used in the processing step. In this step, all sequences are aligned based on guide tree. This algorithm is used in our research to illustrate the effectiveness of the use of the cache and Grid computing technique. E. Iterative algorithms The algorithms work similarly to progressive strategy but repeatedly align the initial sequences and adding new sequence to grow the number of sequence in alignment process. It aims to reduce the inherent errors occurring in progressive method. This method is commonly used for local alignment algorithms. MUSCLE is known as the popular multiple sequence alignment using iterative method. F. HMM-based algorithms Hidden Markov Model is method using probabilistic model can produces both global and local alignment. The algorithms offer significant improvements in computational cost. Several HMM-based MSA software are popular such as POA, SAM. III. CACHE TECHNIQUE AND GRID COMPUTING FOR PROGRESSIVE ALGORITHMS In this paper, the progressive algorithm was improved by appling cache technique in order to reduce computational cost. A. Progressive Algorithm There are three basic steps as follows in this algorithm: • Step 1: Pairwise sequence alignment This step aims to align all pairs of sequences and determine the similarity between those sequences. Some pairwise sequence alignment algorithms can be used in this step, for instance, Sum of pairs, Linear – Sankoff. For each pair of sequences, the algorithm produces a score value that shows the distance between two sequences. A distance matrix n n (n is the number of sequences) is used to store all values. • Step 2: Construct a guide tree Create guide tree by using a bottom-up clustering method NJ (Neighbor-Joining) proposed by Saitou and Nei. This × International Journal of Computer Science and Telecommunications [Volume 3, Issue 7, July 2012] 48 Fig. 3. Progressing phonetic tree shows the relationship between all given sequences. • Step 3: Progressing Using guide tree, group the sequences closely relative together first. Then, combine the group closely relative together. The process repeated until all sequences are grouped. We have three kinds of grouping (Fig. 3): Sequence to sequence, sequence to profile, and profile to profile. (Profile is a group of sequences closely relative). B. Cache Technique The cache is a collection of copies of objects accessed frequently by users. The objects are memory pages or the site, the records of the database or files. Users may be operating systems or computing program. One common application of this technique is web cache. By using proxy servers or browser cache, the documents (or data) located temporal locality or spatial locality are brought closer to users. Thus, it reduces response time required by users significantly. However, one of the major factors that affecting the system’s performance is cache replacement policy when the cache is full. The Cache replacement policy aims to get optimal parameters such as ratio of document found, latency time, total cost. The other important feature of the cache system is data synchronization so that the users always access the latest documents (or data). This technique is similar to the one used to synchronize files or data of distributed database systems. The main feature is concerned in data synchronization process is data consistency. It is categorized into two types: weak consistency and strong consistency [4]. The weakness of cache technique is that it uses large amount of storage space. So depending on the system’s capabilities and users’ needs, they will decide whether to use the technique or not. In this modified algorithm, the progressing step (step 3) reuses the result of the Pairwise Sequence Alignment step (step 1). For instance, in the first test there are five sequences 1, 2, 3, 4, 5. In step 1, the following pairs of sequence are aligned: (1, 2), (1, 3), (1, 4), (1, 5), (2, 3), (2, 4), (2, 5), (3, 4), (3, 5), (4, 5). Then the guide tree is constructed (Fig. 3). In step 3, two pairs of sequence (1, 5), (2, 3) are realigned. In addition, to align six sequences 1, 2, 3, 4, 5, 6 (sequences from 1 to 5 have aligned in the first test) there are following steps: Step 1 needs to align 15 pairs of sequences: (1,2), (1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,4), (3,5), (4,5), (1,6), (2,6), (3,6), (4,6), (5,6). In this case, the 10 pairs of sequences have to be realigned: (1,2), (1,3), (1,4), (1,5), (2,3), (2,4), (2,5), (3,4), (3,5), (4,5). Fig. 4. The Cache system [7] Two mentioned above examples give a motivation to use cache technique so that it reuses the already aligned pairs in the previous steps or executed application. To demonstrate above ideas, the algorithms were modified from available source code, using CaLi library [7] and MSA program using the cache technique proposed by Zola [5]. Moreover, we developed for global progressing algorithm, and deployed them the grid computing system. The cache system architecture is described in Fig. 4. Each worker has a CaLi cache manager that manages its cache and communicates with other cache manager through CaLi cache communicator. The CaLi cache is deployed using MPI-2 (Message Passing Interface standard 2) library, so it could be run on both the parallel system and the Grid computing system. The CaLi cache uses hash table structure to store and retrieve data. There are two level of cache: L1 cache and L2 cache. Whereas the L1 cache allows program to store and retrieve data on internal memory, the L2 cache allows program to do on external memory. Depending on the system’s capability, users will choose to use L1 cache, L2 cache or both of them. C. Grid computing Grid computing is known as a distributed system that connects many computer systems having different hardware and platforms (operating system, system software). It allows applications to run in parallel on multiple machines, clusters, or systems (VO - Virtual Organization). The system is suitable for solving the problems that require a large amount of computation as well as storage capacity. In our research, the Grid system is used for execution bioinformatics problems. IV. EPERIMENTAL RESULTS In this paper, the multiple sequence alignment using cache technique on IOIT-HCM Grid System. This system is connected to PRAGMA system (Fig. 5). There are two clusters with two head nodes: venus.ioit-hcm.ac.vnand moon.ioit-hcm.ac.vn. Each cluster includes 5 nodes (1 head node and 4 worker nodes). In these experiments, the data from the Data Bank NCBI (National Center for Biotechnology Information) were used. The system provides resources for researcher in the field of Bioinformatics and computational biology. A. First data In this test pattern, the program for 60 sequences was executed. Each sequence contains 800 characters. The result shows as follows (Fig. 6a, 6b): Le Van Vinh et al. 49 Fig. 5. IOIT-HCM Grid System Using parallel algorithm without cache, the time is 344,42 sec ≈ 5,74 min. Using parallel algorithm with cache, the time is 270,17 sec ≈ 4,5 min. B. Second Data In the second test data, the program for 90 sequences with 800-character length was executed. 60 sequences in this test have used in the first data. Fig. 6a. Parallel execution Fig. 6b. Parallel execution with cache Fig. 7a. Parallel execution International Journal of Computer Science and Telecommunications [Volume 3, Issue 7, July 2012] 50 Fig. 7b. Parallel execution with cache Fig. 7c. Parallel execution with cache The execution time decreases significantly in the second test as follows: • Using parallel algorithm without cache: 727,78 sec≈ 12,13 min • Using parallel algorithm with cache: 534,62 sec≈ 8,91 min • Using parallel algorithm with cache the second time (for the same data): 19,46 sec≈ 0,32 min. The output of program is guide tree and list of aligned sequences: C. Guide Tree D. List of aligned sequences V. CONCLUSIONS AND FUTURE WORKS The improvement of algorithm using cache technique allows us to reduce computational cost because the results calculated in previous times not to be calculated in the progressive process. Besides, Grid computing is also an effective environment to deploy the Multiple Sequence Alignment program. The combination of two techniques brings us many benefitsto increase program’s execution speed. However, the parallel cache requires large amount of shared storage space, and network bandwidth is also a factor affecting the processing speed of the program. Our future work is studying suitable cache techniques to solve next-generation sequencing problems. In addition, we continue investigating the usefulness of parallel and distributed systems to boost the performance of the bioinformatics programs. Le Van Vinh et al. 51 ACKNOWLEDGMENT Thank to the research group of Czestochowa University of Technology Poland and Zola. They have developed CaLi library [7] and MSA program using the cache technique with their proposed methods [5].We used their library and source codes for our modified algorithms. REFERENCES [1] Waqar Haque, Alex Aravind, Bharath Reddy. Pairwise Sequence Alignment Algorithms – A Survey. ISTA’09, ACM 978, 2009. [2] Yang, Hong Zhao, Ding-Yuan. Research on Multiple Sequences Alignment Algorithm. IEEE ICCIS, 2011. [3] Cedric Notredame. Recent revolutions of Multiple Sequence Alignment Algorithms: A survey. PLoS Computational Biology. Volume 3, 2007. [4] Balamash A., Krunz M.. An overview of web caching replacement algorithms. IEEE Comm. Surv. & Tut., 6(2):44– 56, 2004. [5] Jaroslaw Zola, Parallel Server for Multiple Sequence Alignment, L’Institut National Polytechnique de Grenoble, France, 12/2005. [6] Grasso C, Lee C. Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20, 2004. [7] Cali library: [8] NeedlemanS. B. and WunschC. D., A general method applicable to the search for similarities in the aminoacid sequence of two proteins. Journal of Molecular Biology, vol. 48, pp. 443-453, 1970. [9] Feng D.F., Doolittle R.F.,Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25, 351-360, 1987. [10] Hogeweg P, Hesper B.,The alignment of sets of sequences and the construction of phylogenetic trees. An integrated method. J. Mol. Evol. 20, 175-186, 1984. [11] Taylor W.R.,A flexible method to align large numbers of biological sequences. J. Mol. Evol. 28, 161-169, 1988. [12] L. Batzoglou, J. Pachter, B. Mesirov, B. Berger, and E. S. Lander, Human and mouse gene structure: comparative analysis and application to exon prediction. RECOMB 00: Proc of the 4th Int'l Conference on Computational Molecular Biology, 2000, pp. 46-53. [13] The library for parallel programming MPI: forum.org/docs/docs.html [14] Kuo-Ben Li. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics, 2003. [15] Katoh, Ketal. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. NucleicAcidsRes., 2003. [16] Kazutaka Katoh, Hiroyuki Toh. Parallelization of the MAFFT multiple sequence alignment program. Bioinformatics, 2010. [17] G. Sudha Sadasivam, G. Baktavatchalam. A novel approach to Multiple Sequence Alignment using Hadoop Data Grids. MDAC’10, 2010. [18] K. Arumugam, Y.S. Tan, B.S. Lee, R. Kanagasabai, Cloud- enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis. IPCBEE, 2012. [19] Xun Tu, Kajal T. Claypool and Cindy X.Chen. ACache: Using Caching to Improve the Performance of Multiple Sequence Alignments. IEEE SSDBM’06, 2006.

Các file đính kèm theo tài liệu này:

multiple_sequence_alignment_on_the_grid_computing_using_cach.pdf