Discriminative motif finding to predict HCV treatment outcomes with a semi-supervised feature selection method

It can be observed from Table 2 that 10 subsequences appear many times in SVR sequences, and two or three times in non-SVR sequences. Typically, subsequences such as “WQQ”, “ACTT”, or “DANLLWRQEM” do not appear in non-SVR sequences. Their coverage are from 3.2% to 3.8% (6/184 = 0.043 or 7/184 = 0.038), and their accuracy are 100% (6/(6+0) = 1 or 7/(7+0) = 1). These subsequences are likely to discriminate the SVR property. In addition, subsequences such as “ACC,” “KAA,” “VSL,” “LSLKA,” or “GGDITR” have the higher coverage from 3.2% to 7.6% and the lower accuracy from 76% to 89%. However, they also help to discriminate the SVR property thanks to a majority rule. These subsequences can be discriminative motifs in order to predict the SVR property in a HCV study because they are considered to be significant to the SVR class and not significant to the non-SVR class. A comparison with the results of MEME is presented in Table 3. After five times of experiments of MEME, we collect around 10 motifs, and most of them appear in both SVR and non-SVR sequences. MEME found only three motifs, “RGK,” “TAC,” and “SLKATCTFHHDSPDADLIEANLLWRQEMGGNI,” that appear in the SVR class and do not appear in the non-SVR class. The coverage of “TAC” is 3.8% while the coverage of the other two motifs is 0.5%, a very low coverage. The rest of motifs in Table 3 are found many times in both classes, for example “SLK” appears 121 times in SVR and 147 times in nonSVR, or “THHDSPDADLIEANLLWRQEMGGNITRVESEN” appears 29 times in SVR and 37 times in non-SVR. In our opinion, MEME discovered motifs which are not good enough to differentiate SVR and non-SVR characteristics. Therefore, it just works effectively in the case of finding common motifs describing certain characteristics of a large sequence dataset.

11 trang | Chia sẻ: hachi492 | Lượt xem: 169 | Lượt tải: 0

Bạn đang xem nội dung tài liệu Discriminative motif finding to predict HCV treatment outcomes with a semi-supervised feature selection method, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên

TẠP CHÍ KHOA HỌC TRƯỜNG ĐẠI HỌC SƯ PHẠM TP HỒ CHÍ MINH Tập 17, Số 6 (2020): 950-960 HO CHI MINH CITY UNIVERSITY OF EDUCATION JOURNAL OF SCIENCE Vol. 17, No. 6 (2020): 950-960 ISSN: 1859-3100 Website: 950 Research Article* DISCRIMINATIVE MOTIF FINDING TO PREDICT HCV TREATMENT OUTCOMES WITH A SEMI-SUPERVISED FEATURE SELECTION METHOD Nguyen Thi Tuong Vy 1, Le Thi Nhan2* 1University of Lyon 1, Lyon, France 2 University of Science, Vietnam National University Ho Chi Minh City, Vietnam *Corresponding author: Le Thi Nhan – Email: ltnhan@fit.hcmus.edu.vn Received: January 08, 2020; Revised: February 27, 2020; Accepted: June 02, 2020 ABSTRACT Hepatitis C treatment is currently facing many challenges, such as high costs of medicines, side effects in patients, and low success rates with Hepatitis C Virus genotype 1b (HCV-1b). In order to identify what characteristics of HCV-1b cause drug resistance, many sequence analysis methods are conducted, and bio-markers helping to predict failure rates are also proposed. However, the results may be imprecise when these methods work with a dataset having a small number of labeled sequences and short length sequences. In this paper, we aim to predict outcomes of the HCV-b treatment and characterize the properties of HCV-b by using the combination of a feature selection and semi supervised learning. Our proposed framework improves the prediction accuracy about 5% to 8% in comparison with previous methods. In addition, we obtain a set of good discriminative subsequences that could be considered as biological signals for predicting a response or resistance to HCV-1b therapy. Keywords: discriminative motif; hepatitis C virus; sequential forward floating selection; semi-supervised feature selection 1. Introduction Hepatitis C disease is a kind of transmitted disease primarily caused by Hepatitis C Virus (HCV). This virus affects the liver, and after many years, it could lead to liver cirrhosis, or more serious problems including liver failure or liver cancer. According to World Health Organization (WHO), 71 million people worldwide are chronically infected with HCV and nearly 399,000 people die each year from cirrhosis and liver cancer. Antiviral medicines for chronic HCV are the combinations of pegylated interferon (PegIFN)-alpha and ribavirin (RBV) (Manns et al., 2001). In recent years, this therapy has also associated with the new class of drug, such as sofosbuvir (SOF), simeprevir (SIM), and daclatasvir (DCV) to reduce Cite this article as: Nguyen Thi Tuong Vy, & Le Thi Nhan (2020). Discriminative motif finding to predict HCV treatment outcomes with a semi-supervised feature selection method. Ho Chi Minh City University of Education Journal of Science, 17(6), 950-960. HCMUE Journal of Science Nguyen Thi Tuong Vy et al. 951 side effects and shorten the duration of treatment. However, the result of treatments often fails in almost half of cases, especially HCV genotype 1b (HCV-1b) (Gao et al., 2010). Therefore, knowing the sign of response or resistance, also known as sustained viral response (SVR) or non-sustained viral response (non-SVR), to the above drugs before the treatment is very important and necessary to alleviate distressing symptoms and expense for patients. Several methods for characterizing sequences and discovering motifs were already developed such as position weight matrix (PWM) (Kim, & Choi, 2011; Bailey, Boden, Whitington, & Machanick, 2010), hidden Markov model (HMM) (Lin, Murphy, & Bar- Joseph, 2011), or association mining with domain knowledge (Vens, Rosso, & Danchin, 2011). With a general purpose of pattern searching, these studies showed their ineffectiveness in a case of short input sequences. Consequently, it is very difficult to get the highest probability of patterns when the dataset contains a small number of short and highly similar sequences. In this paper, we approach the characterization and prediction problems by using a semi-supervised feature selection method. We are developing a framework which uses labeled and unlabeled data to select effective feature subsets. Our proposed framework predicts around 56% of accuracy on average, while 50% is of MEME (Bailey et al., 2010). The results of sequence characterization are promising discriminative motifs which provide physicians hints to understand thoroughly the resistance to IFN/RBV therapy of HCV-1b, as well as to get a better treatment for patients. 2. Background 2.1. Sequence characterization Data characterization is a summarization of common features of objects in a target class of data in order to know properties of that class (Han, & Kamber, 2006). In the case of sequence characterization, we usually summarize to find certain patterns or motifs which can represent a class of sequence data. However, a discriminative motif is defined that it occurs frequently in one class of sequences and hardly occurs in the other classes of sequences. Therefore, these discriminative motifs help us to describe characteristics of a class and then classify a sequence into a certain class. Motif discovery methods often use a string-based model or probabilistic-based model to represent discriminative motifs. In a string-based model, a motif is a short sequence of letters which are nucleic acids in DNA/RNA sequences or amino acids in protein sequences. Moreover, letters are special characters to increase the variability of the motif. A method represents this model is MERCI (Motif EmeRging and with Classes Identification) (Vens et al., 2011). It uses an idea of Apriori algorithm to generate candidate motifs during the sequential pattern mining. This method accepts or eliminates motifs based on two parameters which are the minimal frequency in a dataset and the maximal frequency in another dataset. HCMUE Journal of Science Vol. 17, No. 6 (2020): 950-960 952 In a probabilistic-based model, a motif is represented by Position Weight Matrix (PWM) or Hidden Markov Model (HMM). PWM considers a motif as a matrix that each element is the probability of a given acid at a specified position with an independence assumption among positions. HMM describes a motif as a Markov process of hidden states where the probability of the current state of a letter only depends on its previous state with the assumption that these states are not necessarily independent (Wu, & Xie, 2010). A very popular tool to find disciminative motifs nowadays is MEME (Multiple EM for Motif Elicitation) (Bailey et al., 2010). MEME represents a motif as a PWM and assumes that each sequence has zero or one motif. To discriminate motifs, MEME calculates a “position- specific prior” (PSP) of each position in a sequence in order to measure the likelihood that a motif starts at each position of a sequence. PSP plays the role of additional information to assist the search by increasing the probability of start positions containing subsequences that is commonly found in sequences of interest, as well as decreasing the probability of start positions characterizing for sequences that do not contain features of interest. In a work by Lin et al. (2011), a motif is represented by using the profile HMM (Hidden Markov Model). Because this model allows to insert or delete a position in a sequence, and finding motifs is as to find hidden states of sequences. It is more flexible than MEME. The parameters of HMM were estimated by the maximum mutual information estimate technique to obtain the optimum discriminative motifs. In brief, these methods definitely do not converge into the global optimal solution because they used expectation-maximization (EM) or Gibbs sampling algorithm to optimize the likelihood of PWM or HMM. Furthermore, these methods need to learn from a large enough training sequences in order to have precise motifs. If the learning process works with a small number of short sequences, PWM or HMM will not present good discriminative motifs due to a lack of information. As regards string-based methods, the exhaustive search was applied so that the global solution can be achieved. However, some disadvantages may occur such as a large amount of data will make the searching process time-consuming, or finding long length motifs can lead to a high computational complexity. 2.2. Semi-supervised feature selection Feature selection is a significant step of the machine learning area with an aim to improve the learning performance by removing irrelevant features from the training dataset. In the supervised learning, feature selection methods work on labeled data to find the most useful feature subsets that help to increase the prediction accuracy or shorten the training time of classifiers on high dimensional datasets. However, as we all know, the size of labeled data is very limited because they need many human annotation efforts including time and expense, as well as expert-level knowledge. The use of a small labeled dataset together with a large unlabeled dataset to identify relevant features was first introduced by Zhao and Liu (2007). Therefore, conducting a feature selection from mixed labeled and unlabeled data is HCMUE Journal of Science Nguyen Thi Tuong Vy et al. 953 a definition for a semi-supervised feature selection. The survey presented by Sheikhpour, Sarram, Gharaghani, and Chahooki (2017) provides two taxonomies of a semi-supervised feature selection. They are the combination of the basic taxonomy of a feature selection and a semi-supervised learning. In the first taxonomy, methods are classified into groups based on the feature selection such as filter, wrapper, and embedded methods. Each group is then divided into smaller groups based on how to use the unlabeled data to learn the feature subsets. In contrast to the first taxonomy, methods in the second taxonomy are divided into five groups based on semi-supervised learning, such as graph-based, self-training, co-training, support vector machine based (SVM-based), and others. These groups are also divided into smaller groups based on the procedure of a feature selection. Overall, the first taxonomy is the most mentioned one in many studies (Chin, Mirzal, Haron, & Hamed, 2016; Xu, King, Lyu, & Jin, 2010; Chen, Nie, Yuan, & Huang, 2017). 3. Methodology To characterize and predict motifs from sequences by using a semi-supervised feature selection, we develop a framework (Figure 1) consisting of four main steps: data vectorization, a feature selection, semi-supervised learning, and comparative analysis. Figure 1. The proposed framework for sequence characterization and prediction 3.1. Data vectorization In this step, we extract subsequences or motifs from a sequence dataset by using a sliding window technique and consider a subsequence as a feature in the feature selection problem. This means a sequence will be represented by many features, and the value of each feature is the occurrence frequency of that feature in a sequence. Then, we continue to eliminate subsequences which have a low frequency of occurrence in the dataset. Figure 2 demonstrates a sequence vectorized by the occurrence frequency of subsequences in that sequence. HCMUE Journal of Science Vol. 17, No. 6 (2020): 950-960 954 Figure 2. Data vectorization. A sequence S(i) is represented by many subsequences SS(j). The value Xij is the frequency of occurrence of a subsequence SS(j) in the sequence S(i). With this vectorization, extracted subsequences are in different lengths because we want to keep information of short sequences as much as possible. We consider subsequences as motifs in biology and call them features in machine learning. Therefore, the problem of discriminative motif finding leads to the feature selection so that a classification algorithm running on a set of selected features obtains the highest possible accuracy. 3.2. Integration of feature selection and semi-supervised learning The characterization task consists of two integrated steps, feature selection and semi- supervised learning, based on the work by Ren et al. (2008). Concretely, we use a wrapper- based method and the SFFS (Sequential Forward Feature Selection) algorithm for a feature selection and a self-training technique for semi-supervised learning. The idea of this task is shown in Algorithm 1, where nSelect denotes a predefined number of features, selectedFeatures denotes the output features, mat and labels denote the training dataset and their labels respectively. The algorithm begins with the empty set of selected features. While the number of selected features has not yet reached the predefined number, unselected features will be pseudo added into the set of selected features one by one. If the added features can help to increase the accuracy of the learner, we will officially add them to the set of output features. On the other hand, we do nothing with features not improving the accuracy. Algorithm 1 The sequential feature selection algorithm Input: mat, labels, nSelect Output: selectedFeatures selectedFeatures ← NULL accuracy ← 0 selected ← 0 while selected < nSelect do for i ← 0 to length(features) − 1 do if features[i] ∉ selectedFeatures then Pseudo add features[i] to selectedFeatures HCMUE Journal of Science Nguyen Thi Tuong Vy et al. 955 mati ←mat + selectedFeatures accuracyi ← Learning from mati, labels end if end for if max(accuracyi) ≤ accuracy then break else accuracy ← max(accuracyi) idx = arg max(accuracyi) Officially add features[idx] to selectedFeatures mat ← mat + selectedFeatures selected++ end if end while return selectedFeatures In the prediction task, we use the SVM (Support Vector Machine) method to learn a linear function which separates the data into two classes, SVR and non-SVR. We investigate the effectiveness of SVM because none of the previous studies to predict SVR/non-SVR in the case the dataset with few sequences and they are very short in length. Furthermore, SVM is the most widely used model in machine learning. 3.3. Comparative analysis In this section, we discuss how to find subsequences that characterize the SVR/non- SVR class of sequences. That means we want to know which subsequences are potentially discriminative for a class. We firstly obtain a set of potentially discriminative subsequences from the results of the cross-validation experiment and then find discriminative subsequences by estimating and contrasting their contributions to classes. In order to get the final set of subsequences, our idea is to combine the results from different folds of the experiment. Concretely, we keep subsequences that appears at least τ folds. The contribution of subsequences to classes can be approximated by counting the frequency of subsequences in the SVR and non-SVR sequences. After that, we contrast these frequencies to see the class to which a subsequence contributes significantly. In practice, subsequences in which we are mostly interested must have a high contribution. 4. Experiment 4.1. Dataset In this work, the data are sequences before treatment and belong to HCV-1b. The dataset consists of (a) 43 sequences including 21 SVR and 22 non-SVR sequences downloaded from Chiba University; (b) 254 sequences including 141 SVR and 113 non- SVR sequences taken from five published studies (Enomoto et al., 1996; Chayama et al., 1997; Yoon et al., 2007; Rueda et al., 2008; El-Shamy et al., 2011), and (c) 1,444 unlabeled sequences downloaded from GenBank (https://www.ncbi.nlm.nih.gov/genbank/). HCMUE Journal of Science Vol. 17, No. 6 (2020): 950-960 956 4.2. Experimental setting In the semi-supervised feature selection (called SemiFS for short) experiment, a set of parameters consists of sizeFS, startfn, maxIteration, samplingTimes, samplingRate, and fnstep, where sizeFS is the number of output features; startfn is the number of initial features in labeled dataset that are used to train a classifier; maxIteration is the number of times for learning and predicting labels for the unlabeled dataset; samplingTimes is the number of times for adding unlabeled data to labeled data; samplingRate is the number of unlabeled data with predicted labels in order to find more useful features; fnstep is the number of selected features after adding unlabeled data. They are set to 30, 5, 30, 10, 50%, and 6, respectively. We performed the SemiFS many times to obtain these best parameters. In the Algorithm 1, the parameters selectedFeatures is first set to startfn when starting to learn with labeled data, and then set to fnstep when adding unlabeled data to labeled data. We also initialize the parameter accuracy of a learner with 0, a default value. 4.3. Accuracy of prediction To evaluate the effectiveness of our framework, we conduct predictions of SVR and non-SVR with three methods: Perceptron, SVM-linear, and k-NN which are experimented by scikit-learn (Pedregosa et al., 2011), a tool for data mining and analysis. The perceptron learning algorithm is a classification algorithm for the simplest case that has only two classes. In our experiments with SemiFS, we find the setting randomstate=0 to be a reasonable choice, and use default values for other parameters. The SVM-linear classifier is a parametric method and is used with a hypothesis space where the dataset is linearly separable. We choose the regularization parameter C=1.0 and use a default setting for the parameter gamma. And the k-NN, a non-parametric method, is a k nearest neighbor classifier whose the parameter k is a small positive integer. In practice, we chose k=5. We also compare the SemiFS framework with MEME ( a previous study of a discriminative motif finding. The tool MEME is a popular and powerful web-based software to discover motifs in biology. Therefore, we easily conduct the experiment with the following parameters: the length of a motif is between 3 and 32 residues, the occurrence frequency of a single motif per sequence is set to zero or one, and the maximum number of motifs is 10. For all prediction methods, we perform a 5-fold cross validation experiment and the prediction accuracy is averaged from these five folds. Table 1. Accuracies of three methods for prediction Perceptron SVM-linear k-NN Full SemiFS MEME Full SemiFS MEME Full SemiFS MEME Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 0.64 0.47 0.64 0.44 0.51 0.57 0.56 0.62 0.59 0.50 0.59 0.52 0.50 0.44 0.55 0.52 0.52 0.52 0.52 0.51 0.50 0.57 0.54 0.57 0.55 0.54 0.56 0.52 0.55 0.51 0.49 0.47 0.47 0.47 0.43 0.44 0.47 0.62 0.62 0.53 0.40 0.57 0.40 0.44 0.48 Avg. Acc. 0.54 0.57 0.52 0.52 0.55 0.54 0.46 0.54 0.46 HCMUE Journal of Science Nguyen Thi Tuong Vy et al. 957 From Table 1, the average accuracies of each classifier working on our framework are better than the average accuracies of MEME. For example, with the k-NN classifier, the accuracy of SemiFS is 54%, while the accuracy of MEME is 46% (about 8% improvement). In a similar manner, SemiSF has a 57% accuracy with the perceptron classifier, while MEME has a 52% accuracy (about 5% improvement). However, with the SVM-linear classifier, the accuracy of SemiFS is just 1% higher than the accuracy of MEME. Table 1 also shows the effectiveness of SemiSF compared to the classification without doing feature selection (called Full for short). With three classifiers, SemiFS is more accurate from 3% to 5% higher than that of Full. This gives a strong significance for our work, because features or subsequences found by SemiSF are the potential discriminative ones. They increase the classification accuracy, which means that they can contribute to characterizing classes in a dataset. In our case, classes are SVR and non-SVR. 4.4. Discriminative subsequences To find reliable subsequences characterizing SVR and non-SVR sequences, we conduct a subsequences analysis. We collect selected subsequences discovered by SemiFS in five experiments, and then choose a collection of subsequences appearing at least in three folds. Table 2 presents 10 subsequences along with the number of SVR sequences and non- SVR sequences containing these subsequences. These numbers are calculated on the whole labeled dataset. Table 2. Discriminative subsequences characterizing SVR and non-SVR by semifs Subsequence No. of SVR sequences No. of non-SVR sequences ACTT AHH DANLLWRQEM GGS HRDSPDA MGGS QHDSPDADLI RDSPDA VDLV WQQ 6 4 7 6 2 6 1 2 4 6 0 0 0 0 0 0 0 0 0 0 It can be observed from Table 2 that 10 subsequences appear many times in SVR sequences, and two or three times in non-SVR sequences. Typically, subsequences such as “WQQ”, “ACTT”, or “DANLLWRQEM” do not appear in non-SVR sequences. Their coverage are from 3.2% to 3.8% (6/184 = 0.043 or 7/184 = 0.038), and their accuracy are 100% (6/(6+0) = 1 or 7/(7+0) = 1). These subsequences are likely to discriminate the SVR property. In addition, subsequences such as “ACC,” “KAA,” “VSL,” “LSLKA,” or HCMUE Journal of Science Vol. 17, No. 6 (2020): 950-960 958 “GGDITR” have the higher coverage from 3.2% to 7.6% and the lower accuracy from 76% to 89%. However, they also help to discriminate the SVR property thanks to a majority rule. These subsequences can be discriminative motifs in order to predict the SVR property in a HCV study because they are considered to be significant to the SVR class and not significant to the non-SVR class. A comparison with the results of MEME is presented in Table 3. After five times of experiments of MEME, we collect around 10 motifs, and most of them appear in both SVR and non-SVR sequences. MEME found only three motifs, “RGK,” “TAC,” and “SLKATCTFHHDSPDADLIEANLLWRQEMGGNI,” that appear in the SVR class and do not appear in the non-SVR class. The coverage of “TAC” is 3.8% while the coverage of the other two motifs is 0.5%, a very low coverage. The rest of motifs in Table 3 are found many times in both classes, for example “SLK” appears 121 times in SVR and 147 times in non- SVR, or “THHDSPDADLIEANLLWRQEMGGNITRVESEN” appears 29 times in SVR and 37 times in non-SVR. In our opinion, MEME discovered motifs which are not good enough to differentiate SVR and non-SVR characteristics. Therefore, it just works effectively in the case of finding common motifs describing certain characteristics of a large sequence dataset. Table 3. Discriminative subsequences characterizing SVR and non-SVR by SemiFS Subsequence No. of SVR sequences No. of non-SVR sequences DLI ESE RGK SLK TAC TRVESEN THHDSPDADLIEANLLWRQEMGGNITRVESEN SLKATCTFHHDSPDADLIEANLLWRQEMGGNI ATCTTHHDSPDADLIEANLLWRQEMGGNITRV CTTHHDSPDADLIEANLLWRQEMGGNITRVES 149 140 1 121 7 136 29 1 27 27 126 154 0 147 1 151 37 0 37 37 5. Conclusion We developed a framework for characterization and prediction of HCV treatment outcomes by using a semi-supervised feature selection. Our approach was demonstrated to represent well sequence data into numeric vectors, analyze and interpret clearly results of the computational process. This approach works effectively with the data containing short sequences and being similar to another while the traditional methods could not overcome this case of data. Furthermore, it has shown to be a general and flexible method that can be applied to other kinds of sequence data. Potentially discriminative motifs that we found can be good patterns for predicting SVR/non-SVR sequences after being verified by physicians. HCMUE Journal of Science Nguyen Thi Tuong Vy et al. 959  Conflict of Interest: Authors have no conflict of interest to declare.  Acknowledgements: The authors would like to thank Dr. Tatsuo Kanda from the Graduate School of Medicine, Chiba University for his data support. This work was supported by the university-level research project, T2017-02, at University of Science, Vietnam National University Ho Chi Minh City. REFERENCES Bailey, T. L., Boden, M. B., Whitington, T., & Machanick, P. (2010). The value of position-specific priors in motif discovery using meme. BMC Bioinformatics, 11(1). Chayama, K., Tsubota, A., Kobayashi, M., Okamoto, K., Hashimoto, M., Miyano, Y., & Kumada, H. (1997). Pretreatment virus load and multiple amino acid substitutions in the interferon sensitivity - determining region predict the outcome of interferon treatment in patients with chronic genotypes 1h hepatitis C virus infection. Journal of Hepatology, 25(3), 745-749. Chen, X., Nie, F., Yuan, G., & Huang, J. Z. (2017). Semi-supervised feature selection via rescaled linear regression. Proceedings of the 26th International Joint Conference on Artificial Intelligence. Chin, A., Mirzal, A., Haron, H., & Hamed, H. (2016). Supervised, unsupervised, and semi-supervised feature selection: A review on gene election. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13. El-Shamy, A., Shoji, I., Saito, T., Watanabe, H., Ide, Y., Deng, L., & Hotta, H. (2011). Sequence heterogeneity of NS5A and core proteins of hepatitis C virus and virological responses to pegylated-interferon/ribavirin combination therapy. Microbiology and Immunology, 55, 418-426. Enomoto, N., Sakuma, N., Asahina, I., Kurosaki, Y., Murakami, M., Yamamoto, T., & Chifumi Sato, M. D. (1996). Mutations in nonstructural protein 5A gene and response to interferon in patients with chronic hepatitis C virus 1b infection. The New England Journal of Medicine, 334, 77-81. Gao, M., Nettles, R. E., Belema, M., Snyder, L. B., Nguyen, V. N., Fridell, R. A., & Hamann, L. G. (2010). Chemical genetics strategy identifies an HCV NS5A inhibitor with a potent clinical effect. Nature Letters, 465, 96-100. Han, J., & Kamber, M. (2006). Data mining concepts and techniques. Diane Cerra. Kim, J. K., & Choi, S. (2011). Probabilistic models for semi-supervised discriminative motif discovery in DNA sequences. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(5). Lin, T., Murphy, R. F., & Bar-Joseph, Z. (2011). Discriminative motif finding for predicting protein subcellular localization. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8(2). Manns, M., McHutchison, J. G., Gordon, S. C., Rustgi, V. K., Shiffman, M., Reindollar, R., & Albrecht, J. K. (2001). Peginterferon alfa-2b plus ribavirin compared with interferon alfa-2b plus ribavirin for initial treatment of chronic hepatitis C: A randomised trial. The Lancet, 358, 985-965. Pedregosa., F., Varoquaux, G., Gramfort., A., Michel., V., Thirion, B., Grisel, O., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830. HCMUE Journal of Science Vol. 17, No. 6 (2020): 950-960 960 Ren, J., Qiu, Z., Fan, W., Cheng, H., & Yu, P. S. (2008). Forward semi-supervised feature selection. Proceedings of the 12th Pacific-Asia Conference in Knowledge Discovery and Data Mining. Rueda, P. M., Casado, J., Paton, R., Quintero, D., Palacios, A., Gila, A., & Salmeron J. (2008). Mutations in E2-PePHD, NS5A-PKRBD, NS5A-ISDR, and NS5A-V3 of hepatitis C virus genotype 1 and their relationship to pegylated interferon-ribavirin treatment responses. Journal of Virology, 82, 6644-6653. Sami, A., & Nagatomi, R. (2008). A new definition and look at DNA motif. Intech. Sheikhpour, R., Sarram, M. A., Gharaghani, S., & Chahooki, M. A. Z. (2017). A survey on semi- supervised feature selection methods. Pattern recognition, 64. Vens, C., Rosso, M. N., & Danchin, E. G. J. (2011). Identifying discriminative classification-based motifs in biological sequences. Bioinformatics, 27(9), 1231-1238. Wu, J., & Xie, J. (2010). Hidden Markov model and its application in motif findings. Statistical Methods in Molecular Biology, 620, 405-416. Xu, Z., King, I., Lyu, M. R. T., & Jin, R. (2010). Discriminative semi-supervised feature selection via manifold regularization. IEEE Transactions on Neural Networks, 21. Yoon, J., Lee, J. I., Baik, S. K., Lee, K. H., Sohn, J. Y., Lee, H. W., & Yeh, B. I. (2007). Predictive factors for interferon and ribavirin combination therapy in patients with chronic hepatitis C. World Journal of Gastroenterology, 13(46), 6236-6242. Zhao, Z., & Liu, H. (2007). Semi-supervised feature selection via spectral analysis. Proceeding of the 7th SIAM International Conference on Data Mining. TÌM MOTIF PHÂN BIỆT ĐỂ DỰ ĐOÁN KẾT QUẢ ĐIỀU TRỊ HCV VỚI PHƯƠNG PHÁP CHỌN LỌC ĐẶC TRƯNG BÁN GIÁM SÁT Nguyễn Thị Tường Vy1, Lê Thị Nhàn2* 1Trường Đại học Lyon, Pháp 2Trường Đại học Khoa học Tự nhiên, ĐHQG TPHCM, Việt Nam *Tác giả liên hệ: Lê Thị Nhàn – Email: ltnhan@fit.hcmus.edu.vn Ngày nhận bài: 08-01-2020; ngày nhận bài sửa: 27-02-2020; ngày duyệt đăng: 02-6-2020 TÓM TẮT Điều trị viêm gan C hiện đang phải đối mặt với nhiều thách thức, ví dụ như chi phí chữa trị cao, thuốc có tác dụng phụ và tỉ lệ thành công thấp với kiểu gen viêm gan C 1b (HCV-1b). Để xác định đặc tính nào của HCV-1b gây ra kháng thuốc, nhiều phương pháp phân tích chuỗi đã được tiến hành để tìm ra các dấu hiệu sinh học giúp dự đoán tỉ lệ thất bại. Tuy nhiên, kết quả vẫn có thể không chính xác khi các phương pháp này thực hiện trên một tập dữ liệu nhỏ gồm các chuỗi được gán nhãn và có độ dài ngắn. Trong bài báo này, chúng tôi hướng đến dự đoán kết quả điều trị HCV-1b và mô tả đặc trưng của HCV-1b bằng cách kết hợp hai phương pháp lựa chọn đặc trưng và học có giám sát bán. Phương pháp đề xuất của chúng tôi cải thiện độ chính xác dự đoán khoảng từ 5% đến 8% so với các phương pháp trước đó. Ngoài ra, chúng tôi tìm được một tập các motif phân biệt tốt có thể được xem là tín hiệu sinh học để dự đoán đáp ứng hoặc kháng thuốc của điều trị HCV-1b. Từ khóa: motif phân biệt; virus viêm gan C; phương pháp lựa chọn thay đổi liên tiếp; chọn lọc đặc trưng bán giám sát

Các file đính kèm theo tài liệu này:

discriminative_motif_finding_to_predict_hcv_treatment_outcom.pdf