Taxonomic assignment for large-Scale metagenomic data on high-perfomance systems

Most of current taxonomic assignment algorithms developed to be used on a single computer are difficult to adapt to the increasing of metagenomic data. In this work, we present a parallel taxonomic assignment algorithm to boost the speed of processing large-scale metagenomic sequences. The main idea of SeMetaPL is to reduce costs of the homology search and labeling task. Comparing with another single-mode algorithm, SeMetaPL could reduce much computational time, while still obtaining similar accuracy results. Besides, our algorithm has proved to work well with a large-scale metagenome, and promises to be a useful tool for real metagenomic projects. The proposed algorithm could be improved in several ways. Firstly, the implementation of SeMetaPL is based on mpiBlast - an available parallel algorithm. It currently does not get the regular updating of Blast tool. Thus, applying other tools or developing a parallelized homology search tool should be considered. Secondly, SeMetaPL still does not take advantages of multi-core technology which is supported by most of high-performance systems. It motivates us to improve the performance of SeMetaPL in future research direction. Finally, the enhancement of SeMetaPL to improve classification quality by utilizing resources of high-performance systems will also be our concern.

pdf12 trang | Chia sẻ: huongthu9 | Lượt xem: 365 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Taxonomic assignment for large-Scale metagenomic data on high-perfomance systems, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
Journal of Computer Science and Cybernetics, V.33, N.2 (2017), 119–130 DOI 10.15625/1813-9663/33/2/10753 TAXONOMIC ASSIGNMENT FOR LARGE-SCALE METAGENOMIC DATA ON HIGH-PERFOMANCE SYSTEMS LE VAN VINH1, TRAN VAN HOAI2, DUONG NGOC HIEU2, BUI XUAN GIANG2, TRAN VAN LANG3,4 1Faculty of Information Technology, HCMC University of Technology and Education 2Faculty of Computer Science and Engineering, Bach Khoa University 3Institute of Applied Mechanics and Informatics, VAST 4Lac Hong University vinhlv@fit.hcmute.edu.vn  Abstract. Metagenomics is a powerful approach to study environment samples which do not require the isolation and cultivation of individual organisms. One of the essential tasks in a metagenomic project is to identify the origin of reads, referred to as taxonomic assignment. Due to the fact that each metagenomic project has to analyze large-scale datasets, the metatenomic assignment is computationally intensive. This study proposes a parallel algorithm for the taxonomic assignment problem, called SeMetaPL, which aims to deal with the computational challenge. The proposed algorithm is evaluated with both simulated and real datasets on a high performance computing system. Experimental results demonstrate that the algorithm is able to achieve good performance and utilize resources of the system efficiently. The software implementing the algorithm and all test datasets can be downloaded at Keywords. DNA sequences, homology search, metagenomics, parallel algorithm, taxonomic as- signment 1. INTRODUCTION Metagenomics is the study of the genomic content derived directly from complex microbial environ- ment, instead of from culture in laboratories. The discipline offers opportunities to discover microbial communities, and thus brings benefits in many fields, e.g., biotechnology, agriculture, earth sciences [5]. Earlier metagenomic projects usually take many costs to get genomic information directly from microbial samples due to the limit of traditional sequencing technologies (e.g., Sanger sequencing). Fortunately, the next-generation sequencing (NGS) techniques, e.g., 454 pyrosequencing, Illumina Genome Analyzer, AB SOLiD [13], are able to process a large amount of biological data with small costs, and make metagenomic projects feasible. However, it also poses computational challenges for the analysis of metagenomic reads [9, 15]. The taxonomic assignment is an important task in a metagenomic project. The task aims to group reads into bins and determines phylogenetic relationships between the reads and known taxa. Taxonomic assignment algorithms can be roughly classified into composition-based methods and homology-based methods. Composition-based methods (e.g.,TACOA [3], AKE [8]) classify reads by extracting genomic signatures (e.g., oligonucleotide frequencies, GC-content) from themselves. c© 2017 Vietnam Academy of Science & Technology 120 LE VAN VINH, et al. Although these methods are fast, they are difficult to analyze short reads [10]. Recent taxonomic assignment methods (MEGAN [7], CARMA3 [4], MetaCluster-TA [18]) are mainly based on the homology feature. Blast [1] is one of the commonly-used tools to extract homology information between sequences. Those algorithms are demonstrated to work well with both short and long reads. However, a remaining challenge of the methods is that they are computationally expensive [9]. In previous works, we proposed a semi-supervised taxonomic assignment method for metagenomic reads, so-called SeMeta [17]. It consists of two steps, and utilizes both composition and homology features. In the first step, the method applies a clustering step and chooses representatives of clusters. The second step performs homology search task by Blast algorithm to find the relation with known species in reference databases. SeMeta is able to reduce much computational time comparing with other homology-only based algorithms. However, It still requires much computational time. For instance, SeMeta spends 187.67 hours to analyze a dataset of 428674 reads belonging to 10 genomes [17] from the NCBI (National Center for Biotechnology Information) database. This raises the needs of using high-performance computing techniques to boost classification performance. Some metagenomic applications based on high-performance computing techniques are proposed in literature. MrMC-MinH [12] is a map-reduce framework which aims to cluster metagenomic reads. Another taxonomic clustering method for 16S environment datasets, proposed by Yang et al [19] also achieves a cloud based implementation by using map-reduce framework. Parallel-META [14] is a high performance computational pipeline for analyzing metagenomic data. It is based on GPU and multi-core-CPU technology to parallelize a homology search process for speeding up computation. Besides, mpiBlast is a parallel algorithm of the Blast tool. It separates a database into different parts and is based on MPI (Message Passing Interface) technology to perform the homology search distributedly. This work proposes a parallel taxonomic assignment algorithm for metagenomic sequences using MPI technique, called SeMetaPL. The proposed method is an improvement of SeMeta in which its taxonomic assignment step is parallelized to reduce computational time. The algorithm is evaluated on a cloud-based high performance computing system with both simulated and real datasets. Three aspects of virtualized resources of the system considered are memory size, number of CPUs, and number of virtual machines. In the rest of the paper, Section 2 presents the details of proposed algorithm. Section 3 provides experimental results. Some conclusions are presented in the final section. 2. METHODS 2.1. Classification of metagenomic reads with SeMeta SeMeta [17] is a semi-supervised taxonomic assignment for classification of metagenomic reads. It combinedly uses both composition and homology features of sequences in the classification process, and works well with short reads of sufficient mutual coverage. The algorithm consists of two major steps (figure 1) as follows. - Step 1: Clustering This step separates reads into clusters of closely related organisms basing on composition features (l-mer frequency) and sequence overlapping information. The algorithm then selects TAXONOMIC ASSIGNMENT FOR LARGE-SCALE METAGENOMIC DATA 121 a representative, so-called a core, of each cluster. The size of a core is usually smaller than that of the corresponding cluster. Some reads of extremely low-abundance genomes are not clustered in the step, but still considered as a cluster. - Step 2: Taxonomic assignment The step firstly performs the homology search between reads in cores of clusters and reference databases using Blast tool. The algorithm measures of the homology locally instead of attempt- ing to align two sequences over entire sequence lengths. It firstly tries to detect the similarity location between sequences, and then inserts gap-free into them. Finally, a substitution matrix is used to compute the similarity degree between sequences. After the homology search task is performed, cores of clusters are then assigned into a taxon in phylogenetic tree. Each cluster is labeled with the taxon assigned to its core. In post processing task, clusters having the same label are merged into a larger cluster. Some reads not matching with reference database or assigned at the highest level of the phylogenetic tree are regarded as unassigned reads. Experimental results in [17] show that the step is a bottleneck of SeMeta because it requires much computation time. reads similarity search using Blast clusters core unclustered reads unassigned reads reference database taxon A taxon B Step 1: Clustering Step 2: Taxonomic Assignment Figure 1. Process of SeMeta using Blast algorithm. Step 1 separates reads into clusters, and builds cluster cores. Step 2 does homology search between the cores and reference sequences, then labels each cluster [17]. 2.2. Proposed algorithm Due to the limit of SeMeta when processing large-scale datasets, this work proposes a parallelized algorithm, SeMetaPL, which is able to reduce much computational time and utilize resources of high-performance systems efficiently. The method consists of following steps (Figure 2). 122 LE VAN VINH, et al. - Step 1: Clustering in single mode This step is performed at server node in single mode as same as the clustering step of SeMeta. List of reads in cores of clusters are selected from input file and delivered to all computer nodes (or put them in shared storage). - Step 2: Taxonomic assignment in parallel mode + Homology searching with mpiBlast The task uses mpiBlast algorithm [2] to determine the similarity degree between reads in cores of clusters and a reference database. It is a parallelization of Blast using MPI (Message Passing Interface) technique. The algorithm attempts to boost the homology search between sequences and a reference database by segmenting the database. mpiBlast allows each node in computing systems only to search on a portion of the database, and thus it helps reducing disk input/output significantly. Furthermore, the segmentation of databases does not generate heavy intercommunication between nodes. Let n be the number of computer nodes. The reference database is divided into at least n fragments and stored in shared disks. There are two scenarios of using the fragments. The first scenario is that the database fragments are always stored in a shared storage and computer nodes have to do remote access at runtime. In the second scenario, database fragments are distributed to local storage of each computer node, and accessed locally. + Labeling cores of clusters Let k be the number of clusters generated by step 1. k/(n − 1) clusters are labeled at each computer nodes. If k < n− 1, only k nodes are used to perform labeling clusters. The remaining node is used to label unclustered reads. Algorithm 1 shows activities of master node. It computes ranges of clusters and sends to workers which have to process them. The master then determines labels for unclustered reads. Finally, it receives labeling results of clusters from worker nodes. Each worker receives a range of clusters from the master, and labels the clusters (Algorithm 2). It then send cluster labels to the master. + Post processing This task is done at master node to merge clusters having the same label, and determine unassigned reads. 2.3. Performance metrics Two metrics sensitivity and precision are used to evaluate the proposed method. They can be defined as follows (as same as in [11, 17]). Let N be the number of reads, and C be the number of reads assigned by classification algorithms. Assuming that we consider at taxonomic level i, let Xi be the number of reads which are assigned to the correct taxa exactly at or under at the level. The two metrics can be calculated by the following formulations. sensitivity (at level i) = Xi N , TAXONOMIC ASSIGNMENT FOR LARGE-SCALE METAGENOMIC DATA 123 Algorithm 1 Cluster labeling - master Input: A list of clusters, a list of workers Output: Labels of clusters 1: for Worker i do 2: Compute range of clusters xi to yi for worker i 3: for Cluster z, xi ≤ z ≤ yi do 4: Send z to worker i 5: end for 6: end for 7: Determine labels of unclustered reads 8: for Worker i do 9: for Cluster z, xi ≤ z ≤ yi do 10: Receive labels of z from worker i 11: end for 12: end for Algorithm 2 Cluster labeling - worker 1: Receive range of cluster x to y from master 2: for Cluster z, x ≤ z ≤ y do 3: Determine label of cluster z 4: Send label of z to master 5: end for Labling cluster 1 Labling cluster 2 Labling cluster 3 Step 1: Clustering Step 2: Taxonomic Assignment reads Clusters unclustered reads core taxon A taxon B unassigned reads A part of reference database Computer node similarity search Labling other reads Figure 2. Process of SeMetaPL, using mpiBlast precision (at level i) = Xi C . 124 LE VAN VINH, et al. For example, given a read originating from Bordetella avium, when we consider at genus level, a labeling of the read as Bordetella, Bordetella bronchiseptica or Bordetella pertussis would increase Xi. The metrics are computed at four taxonomic levels: species, genus, family, and order. 2.4. Datasets and reference databases In order to generate datasets, we download real bacterial genomes from the NCBI (National Center for Biotechnology Information) database. Three simulated datasets are created by ART tool [6] following whole genome shotgun sequencing techniques. The datasets, presented in Table 1, contain single-end reads with the length of 150bp and follows the Illumina error profile. SeMetaPL also is used to classify the Acid Mine Drainage (AMD) dataset [16] - a real metagenome. It consists of 180,713 sequences, downloaded from NCBI trace archive. Table 1. Simulated datasets Dataset Species/Strain Coverage No. of reads ds1 Borrelia burgdorferi JD1 15 450 Methylobacterium extorquens DM4 20 600 ds2 Marinomonas mediterranea MMB1 10 270 Mycobacterium liflandii 128FXT 15 405 Nitrosopumilus maritimus SCM1 15 405 ds3 Bordetella avium 197N 10 250 Burkholderia xenovorans LB400 10 240 Methanosarcina mazei Go1 15 375 Neisseria meningitidis Z2491 15 375 Reference database used for analyzing the real metagenome is entire bacterial RefSeq database (release 69, downloaded from the NCBI database) with approximately 24 GB after formatted by mpiBlast. In case of simulated dataset, because we needs to conduct a lot of running scenarios, it is better to analyze with a smaller database. Thus, a part of the bacterial RefSeq database with approximately 5.3 GB (after formatted) is used. All of species in the tested datasets are contained in the database. 3. EXPERIMENTS RESULTS 3.1. Experimental setup Experiments for simulated datasets are conducted on a virtualized system hosted on two physical machines. Each machine consists of 12 CPUs, 120G RAM, and 100GB disk storage. The performance of SeMeta is evaluated with different aspects of virtual resources (memory sizes, number of virtual machines, number of processors). The performance of SeMetaPL is compared with SeMeta in cases of using similar virtual resources. Besides, classification qualities of the two algorithms are also considered. SeMeta uses Blast tool (version 2.4) which is downloaded from the NCBI website to do homology search task. SeMetaPL performs the search task by using the latest version of mpiBlast (version 1.6.0). This version of mpiBlast uses Blast 2.2.20. TAXONOMIC ASSIGNMENT FOR LARGE-SCALE METAGENOMIC DATA 125 In case of real metagenome, a system with higher computing resources is used. It consists of 9 virtual machines with 200G RAM, and 5TB shared disk storage. 3.2. Results 3.2.1. Effects of the numbers of processors on running time In order to measure the performance of SeMetaPL on multiple processor machines, this work generates 7 virtual machines with numbers of processors of 1, 3, 5, 7, 9, and 11, respectively. Other resources of the machines are similar. The number of processes running concurrently on each machines is set by 15. It is noted that, memory of each machine is enough for running all processes at the same time. Those machines are also used to run SeMeta algorithm for the same datasets (ds1, ds2, and ds3). Line chart in figure 3 presents average running time of SeMetaPL and SeMeta for the three datasets. It can be seen that using multiple processors is able to boost the performance of SeMetaPL. For instances, the case of using five processors is approximately six times faster than the one of using one processor. When the number of processors used increases from 5 to 11, running time of SeMeta slightly decreases. SeMeta runs at single mode, thus it does not utilize the advantages of multiple processor machines. The performance of SeMeta still keeps stable with the increase of the number of processors. When the number of processors is higher than 3, SeMetaPL achieves much better performance compared with SeMeta. In case of 1 or 3 processors, SeMetaPL requires similar or higher running time than SeMeta. It can be understood because SeMetaPL spends time for scheduling tasks and exchang- ing jobs between processes. Besides, because there are many unshared-memory processes running concurrently, SeMetaPL consumes larger amount of memory than SeMeta. 0 50 100 150 200 250 300 350 1 3 5 7 9 11 Ru nn ing tim e ( mi nu te) Number of processors SeMetaPL SeMeta Figure 3. The performance of SeMetaPL and SeMeta with different numbers of processors 126 LE VAN VINH, et al. 3.2.2. Effects of the number of virtual machines and memory sizes on running time This experiment considers the strength of SeMetaPL when running on a cluster of machines. Twenty virtual machines are used in the experiment. Each machine consists of one processor. Two cases of memory sizes are considered. The first case tests on 10 machines having memory size of 3GB, while the second one tests on 10 machines with 6GB memory size. Figure 4 shows results of the experiment. Line chart in the figure presents that the performance of SeMetaPL is proportional to the number of virtual machines. The increase of the number of machines from 2 to 5 helps reducing running time significantly. When the number of machines increases from 5 to 10, running time of SeMetaPL decreases moderately. It can be explained that disk input and output costs required rise when the number of machines increases, and thus it reduces the performance of the application. The results also demonstrate that there is an effect of memory size on the performance of SeMetaPL. In the first case, machines have less memory size, and thus spend more running time than those of the second case for all tests. 0 50 100 150 200 250 300 350 2 3 4 5 6 7 8 9 10 Ru nn ing tim e ( sec on d) Number of virtual machines Case 1 (3GB RAM) Case 2 (6GB RAM) Figure 4. The performance of SeMetaPL with different numbers of virtual machines, with cases of using 3GB RAM and 6GB RAM. 3.2.3. Classification quality The classification qualities of SeMetaPL and SeMeta are also computed for three dataset ds1, ds2, and ds3. Table 2 presents the precision and sensitivity of the two methods. It can be seen from the table that, SeMetaPL and SeMeta return the same results for most of the test cases. The results can be understood because the classification technique in SeMetaPL is as same as the one in SeMeta. There are some different results at species and genus levels. The difference is due to that the mpiBlast algorithm used in SeMetaPL is derived from a Blast algorithm having different version with the TAXONOMIC ASSIGNMENT FOR LARGE-SCALE METAGENOMIC DATA 127 one in SeMeta. Because the blast tool used in SeMeta has better quality in determining similarity degrees between sequences comparing with the one in SeMetaPL (from BLAST+ Release Notes, NCBI website), the proposed algorithm returns lower sensitivity and precision values compared with SeMeta at species level. In addition, both methods identify labels for clusters instead of individual reads. Thus, if one of them fails to predict a label of a cluster at a specific level, their precision and sensitivity values will much lower than those of the remaining one. For instance, SeMetaPL gets 56.95% sensitivity and 57.12% precision higher than SeMeta for dataset ds1 at genus level. Conversely, SeMeta achieves 23.06% sensitivity and 50.64% precision higher than those of SeMetaPL at species level for dataset ds3. At higher levels (family and order level), two algorithms achieve the same both sensitivity and precision values for all cases. 3.2.4. Results on AMD dataset A previous study in [16] recovered that the AMD dataset contains several dominant species. Among the species, Leptospirillum sp. Group II, Leptospirillum sp. Group III belong to bacteria, and three other species belong to archaea. It takes approximately 606 hours to analyze the dataset. There are approximately 67.32% of the AMD sequences assigned by SeMetaPL. Results of the experiment, presented in table 3, support the previous studies. Our algorithm has detected genus Leptospirillum that account for 52.48% of assigned sequences, and other bacterial organisms (47.52%). Although the reference database does not contain two species Leptospirillum sp. Group III and Leptospirillum sp. Group II, SeMeta identified their genus due to the presence of other species belonging to the taxon in the database. Besides, because the experiment uses bacterial RefSeq database, SeMetaPL could not detect the existence of archaea organisms. 4. CONCLUSIONS Most of current taxonomic assignment algorithms developed to be used on a single computer are difficult to adapt to the increasing of metagenomic data. In this work, we present a parallel taxonomic assignment algorithm to boost the speed of processing large-scale metagenomic sequences. The main idea of SeMetaPL is to reduce costs of the homology search and labeling task. Comparing with another single-mode algorithm, SeMetaPL could reduce much computational time, while still obtaining similar accuracy results. Besides, our algorithm has proved to work well with a large-scale metagenome, and promises to be a useful tool for real metagenomic projects. The proposed algorithm could be improved in several ways. Firstly, the implementation of SeMetaPL is based on mpiBlast - an available parallel algorithm. It currently does not get the regular updating of Blast tool. Thus, applying other tools or developing a parallelized homology search tool should be considered. Secondly, SeMetaPL still does not take advantages of multi-core technology which is supported by most of high-performance systems. It motivates us to improve the performance of SeMetaPL in future research direction. Finally, the enhancement of SeMetaPL to improve classification quality by utilizing resources of high-performance systems will also be our concern. 128 LE VAN VINH, et al. Table 2. The classification quality of SeMetaPL and SeMeta on the datasets at different taxonomic levels Method Species Genus Family Order level level level level Dataset ds1 SeMeta Sen. 42.76% 42.76% 99.71% 99.71% Pre. 42.88% 42.88% 100% 100% SeMetaPL Sen. N/A 99.71% 99.71% 99.71% Pre. N/A 100% 100% 100% Dataset ds2 SeMeta Sen. 24.72% 30.24% 61.94% 61.94% Pre. 39.91% 30.34% 100% 100% SeMetaPL Sen. 24.72% 30.24% 61.94% 61.94% Pre. 39.91% 30.34% 100% 100% Dataset ds3 SeMeta Sen. 46.69% 64.84% 64.84% 64.84% Pre. 67.09% 93.16% 93.16% 93.16% SeMetaPL Sen. 23.64% 64.84% 64.84% 64.84% Pre. 16.45% 93.16% 93.16% 93.16% N/A= Not Available. The bold values indicate the best results among the algorithms in the aspect of sensitivity (Sen.) or precision (Pre.). Table 3. Results of SeMetaPL on the AMD dataset using bacterial ReqSeq database Detected organisms Number of sequences Ratio Leptospirillum 63846 52.48% Other organisms 57815 47.52% Acknowledgment This research was funded by HCMC University of Technology and Education, under constract number T2017-26TD. The authors would like to thank Faculty of Computer Science and Engineering, Bach TAXONOMIC ASSIGNMENT FOR LARGE-SCALE METAGENOMIC DATA 129 Khoa University for providing facilities for this study. The applications presented in this paper were tested on the High Performance Computing Center (HPCC) of the faculty. REFERENCES [1] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990. [2] A. E. Darling, L. Carey, and W. C. Feng, “The design, implementation, and evaluation of mpiblast,” Los Alamos National Laboratory, Tech. Rep., 2003. [3] N. N. Diaz, L. Krause, A. Goesmann, K. Niehaus, and T. W. Nattkemper, “Tacoa–taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach,” BMC bioinformatics, vol. 10, no. 1, p. 56, 2009. [4] W. Gerlach and J. Stoye, “Taxonomic classification of metagenomic shotgun sequences with carma3,” Nucleic acids research, vol. 39, no. 14, pp. e91–e91, 2011. [5] J. Handelsman, The new science of metagenomics: Revealing the secrets of out microbial planet. The National Academies Press, 2007. [6] W. Huang, L. Li, J. R. Myers, and G. T. Marth, “Art: a next-generation sequencing read simulator,” Bioinformatics, vol. 28, no. 4, pp. 593–594, 2011. [7] D. H. Huson, S. Mitra, H. J. Ruscheweyh, N. Weber, and S. C. Schuster, “Integrative analysis of environmental sequences using megan4,” Genome research, vol. 21, no. 9, pp. 1552–1560, 2011. [8] D. Langenka¨mper, A. Goesmann, and T. W. Nattkemper, “Ake-the accelerated k-mer ex- ploration web-tool for rapid taxonomic classification and visualization,” BMC bioinformatics, vol. 15. [9] S. S. Mande, M. H. Mohammed, and T. S. Ghosh, “Classification of metagenomic sequences: methods and challenges,” Briefings in bioinformatics, vol. 13, no. 6, pp. 669–681, 2012. [10] M. H. Mohammed, T. S. Ghosh, N. K. Singh, and S. S. Mande, “Sphinx - an algorithm for taxonomic binning of metagenomic sequences,” Bioinformatics, vol. 27, no. 1, pp. 22 – 30, January 2011. [11] R. Ounit, S. Wanamaker, T. J. Close, and S. Lonardi, “Clark: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers,” BMC genomics, vol. 16, no. 1, p. 236, 2015. [12] Z. Rasheed and H. Rangwala, “A map-reduce framework for clustering metagenomes,” in Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International. IEEE, 2013, pp. 549–558. [13] J. Shendure and H. Ji, “Next-generation dna sequencing,” Nature biotechnology, vol. 26, no. 10, pp. 1135–1145, 2008. [14] X. Su, J. Xu, and K. Ning, “Parallel-meta: efficient metagenomic data analysis based on high- performance computation,” BMC Systems Biology, vol. 6, no. 1, p. S16, 2012. [15] H. Teeling and F. O. Glo¨ckner, “Current opportunities and challenges in microbial metagenome analysisa bioinformatic perspective,” Briefings in bioinformatics, vol. 13, no. 6, pp. 728–742, 2012. 130 LE VAN VINH, et al. [16] G. W. Tyson, J. Chapman, P. Hugenholtz, E. E. Allen, R. J. Ram, P. M. Richardson, V. V. Solovyev, E. M. Rubin, D. S. Rokhsar, and J. F. Banfield, “Community structure and metabolism through reconstruction of microbial genomes from the environment,” Nature, vol. 428, no. 6978, pp. 37–43, 2004. [17] V. Van Le, L. Van Tran, and H. Van Tran, “A novel semi-supervised algorithm for the taxonomic assignment of metagenomic reads,” BMC bioinformatics, vol. 17, no. 22, 2016. [18] Y. Wang, H. C. M. Leung, S. M. Yiu, and F. Y. L. Chin, “Metacluster-ta: taxonomic annotation for metagenomic databased on assembly-assisted binning,” BMC Genomics, vol. 15, 2014. [19] X. Yang, J. Zola, and S. Aluru, “Large-scale metagenomic sequence clustering on map-reduce clusters,” Journal of bioinformatics and computational biology, vol. 11, no. 01, p. 1340001, 2013. Received on September 24, 2017 Revised on December 07, 2017

Các file đính kèm theo tài liệu này:

  • pdftaxonomic_assignment_for_large_scale_metagenomic_data_on_hig.pdf