Đề tài The Biological Sample Classification Using Gene Expression Data

FOREWORD .1 CHAPTER 1 .3 INTRODUCTION TO GENE EXPRESSION DATA .3 1.1. GENE EXPRESSION 3 1.2. DNA MICROARRAY EXPERIMENTS .5 1.3. HIGH-THROUGHPUT MICROARRAY TECHNOLOGY .8 1.4.MICROARRAY DATA ANALYSIS .12 1.4.1. Pre-processing step on raw data .14 1.4.1.1 Processing missing values 14 1.4.1.2. Data transformation and Discretization 15 1.4.1.3. Data Reduction .16 1.4.1.4. Normalization .17 1.4.2. Data analysis tasks 18 1.4.2.1. Classification on gene expression data .18 1.4.2.2. Feature selection .21 1.4.2.3. Performance assessment .21 1.5. RESEARCH TOPICS ON CDNA MICROARRAY DATA 22 CHAPTER 2 .25 GRAPH BASED RANKING ALGORITHMS WITH GENE NETWORKS 25 2.1. GRAPH BASED RANKING ALGORITHMS .25 2.2. INTRODUCTION TO GENE NETWORK 29 2.2.1. The Boolean Network Model .30 2.2.2. Probabilistic Boolean Networks .31 2.2.3. Bayesian Networks 31 2.2.4. Additive regulation models .33 CHAPTER 3 .35 REAL DATA ANALYSIS AND DISCUSSION .35 3.1. THE PROPOSED SCHEME FOR GENE SELECTION IN SAMPLE CLASSIFYING PROBLEM 35 3.2. DEVELOPING ENVIRONMENT 37 3.3. ANALYSIS RESULTS .38 REFERENCES 43

50 trang | Chia sẻ: maiphuongtl | Lượt xem: 1857 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Đề tài The Biological Sample Classification Using Gene Expression Data, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

then be eliminated or not out of final subset based on whether it sastifies some predefined criteria such as information gain and entropy-based measure, statistical tests or interdependence analyses. In most situations, as the result of selection methods, the good set of 17 variables obtained may contain the correlating genes. Moreover there are some genes filtered out that only expose their meaningfullness in conjunction with other genes (variables). Taking into account more than one genes (variables) at once, the multivariate feature selection methods such as cluster analysis techniques, and multivariate decision trees compute a correlation matrix or covariance matrix to detect redundant and correlated variables. In the covariance matrix, the variables with large values tendency tend to have large covariance scores. The correlation matrix is calculated in the same fashion but the value of elements are normalized into the interval of [-1, 1] to eleminate the above effect of large values of variables [35]. The original set of genes (variables) can be reduced by the procedure that merges the subset of highly correlated genes (variables) into one variable so that the derived set contains the mutually largely uncorrelated variables but still reserve the original information content. For example, we can replace a set of gene or array profiles highly correlated by some average profile that conveys most of the profiles' information. Besides, the Principal Component Analysis (PCA) methods summarizing patterns of correlation, and providing the basis for predictive models is a feature- merging method commonly used to reduce microarray data [26]. 1.4.1.4. Normalization Ideally, the expression matrix contains the true level of transcript abundance in the measured gene-sample combination. However, because of naturally biased measurement condition, the measured values usually deviate from the true expression level by some amount. So we have measured level = truth level + error, Where error comes from systematic tendency of the measurement instrument to detect either too low or too high values [35] and the wrong measurement. The former is called bias and the latter is called variance. So error is the sum of bias and variance. The variance is often normally distributed, meaning that wrong measurements in both directions are equally frequent, and that small deviations are more frequent than large ones. 18 Normalization is a numerical method designed to deal with measurement errors and with biological variations as follows. After the raw data is pre-processed with tranformation procedure, e.g., base-2 logarithm, the resulting matri can be normallized by multiplying each element on an array with an array-specific factor such that the mean value is the same for all arrays. Futher requirement, the array- specific factor must sastify that the mean for each array equals to 0 and the standard deviation equals 1. 1.4.2. Data analysis tasks Right after the data pre-processing step is employed, a numerical analysis method is deployed corresponding to the scientific analysis task. The elementary tasks can be divided into two categories: prediction and pattern-detection (Figure 1.9). Due to the scope of this thesis, only two topics classification and gene regulatory network will be discussed in the following sections. Prediction Pattern-detection Classification Regression or Estimation Time-series Prediction Clustering Correlation analysis Assosiation analysis Deviation detection Visualization Figure 1.10: Two classes of data analysis tasks for microarry data. 1.4.2.1. Classification on gene expression data Classification is a prediction or supervised learning problem in which the data objects are assigned into one of the k predefined classes {c1, c2, …, ck}. Each data object is characterized by a set of g measurements which create the feature vector or vector of predictor variables, X=(x1,…,xg) and is associated with a dependent variable (class label), Y={1,2,…,k }. We call the classification as binary if k=2 otherwise as multi-classification. Informly a classifier C can be thought as a partition of the feature space X into k disjoint and exhaustive subsets, A1, ...,Ak, containing the subset of data objects whose assigned classes are c1, …, ck respectively. 19 Classifiers are derived from the training set L= {(x1,y1),…,(xn,yn)} in which each data object is known to belong to a certain class. The notation C(.; L) is used to denote a classifier built from a learning set L [24]. For gene expression data, the data object is biological sample needed to be classified, features correspond to the expression measures of different genes over all samples studied and classes correspond to different types of tumors (e.g., nodal positive vs. negative breast tumors, or tumors with good vs. bad prognosis). The process of classifying tumor samples concerns with the gene selection mentioned above, i.e., the identification of marker genes that characterize different tumor classes. For the classification problem of microarray data, one has to classify the sample profile into predefined tumor types. Each gene corresponds to a feature variable whose value domain contains all possible gene expression levels. The expression levels might be either absolute (e.g., Affymetrix oligonucleotide arrays) or relative to the expression levels of a well defined common reference sample (e.g., 2-color cDNA microarrays). The main obstade encountered during the classification of microarry data is a very large number of genes (variables) w.r.t the number of tumbor samples or the so-called “large p, small n” problem. Typical expression data contain from 5,000 to 10,000 genes for less than 100 tumor samples. The problem of classifying the biological samples using gene expression data has becomed the key issue in cancer research. For successfullness in diagnosis and treatment cancer, we need a reliable and precise classification of tumors. Recently, many researchers have published their works on statistical aspects of classification in the context of microarray experiments [14,17]. They mainly focused on existing methods or variants derived from those. Studies to date suggest that simple methods such as K Nearest Neighbor [17] or naive Bayes classification [13,3], perform as well as more complex approaches, such as Support Vector Machines (SVMs) [14]. This section will discuss the native Bayes and k Nearest Neighbours methods. Finally we will describe issue of performance assessment. 20 The naïve Bayes classification Suppose that the likelyhood pk(x)=p(x | Y=k) and class priors πk are known for all possible class value k. Bayes' Theorem can be used to compute the posterior probability p(k | x) of class k given feature vector x as ∑ == Kl ll kk xp xpxkp 1 )( )( )|( π π The native Bayes classification predicts the class )(xCB of an object x by maximizing the posterior probability )|(maxarg)( xkpxC kB = Depending on parametric or non-parametric estimations of p(k|x), there are two general schemes to estimate the class posterior probabilities p(k|x): density estimation and direct function estimation. In the density estimation approach, class conditional densities Pk(x) = p(x | Y=k) (and priors πk) are estimated separately for each class and Bayes' Theorem is applied to obtain estimates of p(k | x). The maximum likelihood discriminant rules (Fisher, 1922); learning vector quantization [18]. Bayesian belief networks [8] are examples of the density estimation. In the direct function estimation approach, posteriors p(k | x) are estimated directly based on methods such as regression technique [19]. The examples of this approach are logistic regression [19]; neural networks [19]; classification trees [20] and nearest neighbor classifiers [17]. Nearest Neighbor Classifiers Nearest neighbor classifiers were developed by Fix and Hodges (1951). Based on a distance measurement function for pairs of samples, such as the Euclidean distance, the basic k-nearest neighbor (kNN) classifier classify a new object on the basis of the learning set. First, it finds the k closest samples in the learning set with the new object. Then, it predicts the class by majority vote, e.g. choose the class that is most common among those k nearest neighbors. In kNN, the number of neighbors k should be chosen carefully so as to maximize the performance of the classifier. This is still a challenging problem for most cases. A common approach to overcome this problem is to select some 21 specific values of k and implementing classifier on the training set w.r.t each value of k. The error rate for each value of k is calculated based on the so-called leave- one-out cross validation fashion. The value of k with smallest cross-validation error rate is chosen as the best parameter for the classsifier. The nearest neighbor rule can be refined and extended to deal with unequal class priors, differential misclassification costs, and feature selection. Instead of the equal treatment among the neighbors, the refinement version introduces the weighted voting scheme for the neighbors. That is, each neighbor is assigned a weight based on its contribution to the process of deciding the new objects’ class. 1.4.2.2. Feature selection Feature selection as mentioned above is one of the most important issues in classification, especially when applied to microarray data. In the machine learning [24], feature selection can be distinguished into two categories as filter and wrapper methods. In the former, feature selection process is performed priorily to building of classifier. Some univariate test statistics such as: t- or F-statistics; ad hoc signal- to-noise statistics [17]; non-parametric Wilcoxon statistics [19]; p-values are implemented to procedure a subset of genes considered as feature set. The selection process is implemented based on the predefined number of genes or statistical test value cut-off. In wrapper methods, the feature selection is implemented as an inherent part of the classifier building procedure. Different classifiers will use the different approaches to feature selection. For example, in classification trees (CART; Breiman et al., 1984), features are selected at each step by the procedure of pruning the tree using cross-validation. In the nearest neighbor classification, the automatic feature selection can be obtained by suitable modification the distance measuring function. 1.4.2.3. Performance assessment Obviously, different classifiers exhibit different accuracy rates. So it is necessary to develop a technique to reliably estimate of the classification error. As a result, it guarantees the best classifier will be chosen for latter implementation. This is very important to the problem of classifying tumor samples, because the misclassification could lead to misdiagnosis and assignment to improper treatment protocol. 22 For the purpose of performance assessment, we need a test set of labeled objects which are the samples independent from the available learning set. The classifier is applied on the test set and the classification error rate will be calculated as the proportion of test cases with discordant prediction to the true class labels. In practice, the original learning set L is randomly divided into V subsets Lv, v=1, ….,V of nearly equal size. The sets L-Lv is used as training set for building the classifiers and error rates are computed from validation sets Lv. This procedure is repeated over each subset Lv and the average error is obtained. This scheme for performance estimation is called Cross-Validation Estimation. It has a limitation that it reduces effective sample size for training purposes especially in the microarray data since the number of samples is relatively small. 1.5. Research topics on cDNA microarray data cDNA microarray experiment is concerned with a lot of research fields. Differential Gene Expression is the first field of study, which investigates only a single gene profile to find those genes that expose different expression levels under different experimental conditions. These include different tissue types on different developmental stages of the organism. Its frequent application is in the investigation of gene expression level under normal-versus-diseased state. The second research field is gene co-regulation study, that mainly focuses on comparision more than one gene profiles to identify genes whose expression levels are correlated in certain experiment conditions. The gene co- regulation has two basic forms: positive and negative. Two genes are called to have a positive co- regulation if their expression levels indicate the same increasing or decreasing tendency. They imply a negative co-regulation if the tendency is on the opposite direction. For example, the genes a and c in our four-gene example (Table 1.6) exhibit a negative co-regulation pattern across the ten studied samples. The third field of study is gene function identification. Here the function of new genes can be revealed through the process of comparizing its expression profile against the profiles of genes with known functions. The prior known 23 functions of genes with high similarity will be used for inferring the function of the new gene. The next topic is to study the inside cells to understande gene functions. The fourth research field, Identification of Pathways and Gene Regulatory Networks, will help to answer it. This helps biologists understand the processes by which genes and their products (i.e. proteins) play their roles in cells, tissues, and organisms. Given a pathway, the changes in expression level of investigated genes are monitored to identify pathways. Moreover, the gene expression level is also regulated by the products of other genes. This means that there exists a gene regulatory network by which a gene activities are regulated by others. Therefore, reconstructing gene regulatory network is the main goal of this field of study. These studies require time-course microarray data to be generated. Fifthly, in clinical diagnosis, the cDNA microarry data is useful for discovering expression patterns that are characteristics of a particular disease and also useful for the inference unknown subtypes of known diseases. This is achieved by revealing characteristically different expression profiles that correlate with clinically distinct subtypes of a disease. For each different disease already identified, suppose that we know a chemical compounds used to cure this disease. It is necessary to study the different changes in gene expression pattern in response to different dosages of this chemical compound. This is the goal that the sixth research field, dose-response study, must achieve. The diseases are always correlated with some common variations such as insertions, deletions of a few nucleotides in sequences or in the repeated number of a motif. It is any sequence pattern that decides the molecule's function, structural feature, or family membership. The seventh research field called sequence variantion will takes advantages from cDNA microarry data to discover these variations. These common type of sequence variation are called single nucleotide polymorphism (SNP) [35] which occur with a frequency of roughly 1 per 1,000 nucleotides. The microarray experiments can be useful for the good analyzing SNP variation if at least three following categories in designing the microarray 24 experiment are qualified: (a) arrays including all known SNPs of a human genome sequence, (b) microarrays containing a sample of SNPs located across the entire human genome, and (c) devices for re-sequencing a sample of the human genomic sequence. Finally, the cDNA microarray data is time series sometimes where the expression level of each gene are repeatedly measured after an interval of time from the same source. The eighth research field of study, time-course study, is given the birth for the purpose of processing this special temporary characteristics of time-series data. This study is also usefull in studing cell cycle phenomena and also in reconstructing the gene network. 25 Chapter 2 Graph based ranking algorithms with gene networks 2.1. Graph Based Ranking Algorithms Ranking is the problem of ordering a given set such that more important elements come first in the order list. With X denoting the considered set of object, the ranking function is usually defined as a function f: X Æ R that assigns a ranking value to each object such that more important objects will receive higher ranking scores. The problem of ranking has recently gained much attention in machine learning [10,11, 15,34]. Although the main form of data considered so far is vector- valued data, but sometimes there are the intrinsic realationships exist among objects in the set. The Webpages, journals and conferences, publications and scholar authors are best illustrations for this intrinsic relationships, i.e., link/citation relationships [7,21, 30]. Here each object corresponds to a node in the graph and the intrinsic relationships are described by directed edges between the nodes. This characteristic is taken into account by graph based ranking algorithms to improve the accuracy of the ranking result. Being used in the Information Retrieval systems, the graph based ranking algorithm take advantages of anchor links among webpages to generate a of webpages with the highest relevance to the interested topics required by end users from millions of pages. The graph based ranking algorithm is also used to compute the impact factor of the journals and conferences. It represents each journal or conference as a vertice. Each weighted edge between these vertices represent the total number of citations from one journal or conference to another. Apart from this, the honour ranking of paper’s authors can also be calculated. PageRank by Brin and Page [7] and Hypertext Induced Topic Selection (HITS) by Kleinberg [21,22] are the most successful graph-based ranking algorithms. For the rest of the thesis, we will use the following notations. 26 G=(V,E) denotes a directed graph with the set of vertices V and set of edges E. The Inode(v) represents the set of all nodes pointing to the node v and Onode(v) are the set of all nodes that the node v point to. (Figure 2.1) HITS Algorithm Originally, the HITS algorithm is applied to web documents. Here, the documents represent the vertices in the graph and the links between them represent them represent the edges. The HITS algorithm assigns each vertex vi a so called authority score a(vi) and a hub score h(vi). The algorithm defines a good hub as a vertex with many outgoing edges, and a good authority as a vertext receiving a lot of incoming edges. Hubs and authorities exhibit a mutual relationship: a better hub links to many good authorities, and a better authority is pointed to by many good hubs. The authority and hub score of each vertex is recursively calculated according to the global information of all vertices in the graph rather than local one of this node as the following: ∑ ∑ ∈ ∈ = = )v(Onodev ji )v(Inodev ji ij ij )v(a)v(h )v(h)v(a The authority and hub scores were proven to be converged after a number of iterations. Moreover, each edge connecting two vertices shows the different affect of the pointing node to the pointed one. If each edge is assigned a particular weight, the HITS algorithm is extended as the following equations: G Outcoming edges Incoming edges v . . . . . . Inode(v) ……… Onode(v) ……… Figure 2.1. Graph, incoming and outcoming nodes 27 ∑ ∑ ∈ ∈ = = )v(Onodev jiji )v(Inodev jjii ij ij )v(aw)v(h )v(hw)v(a PageRank Algorithm Introduced by Page et al. [25] as the solution for ranking the webpages, the PageRank algorithm computes the importance of a page based on the following recommendation: “a page with high PageRank is a page referenced by many pages with high PageRank”. Being simple and robust, this algorithm is implemented as the part of many today’s commercial search engines such as Google. PageRank score of vertex can be considered as a “vote” for its importance by all the other vertext pointing to it. For the vertex vj whose number of out links is )( jvOnode , each of these out links will convey the vote value )( )( j j vOnode vPR for the vertex to which it points (Figure 2.2). The PageRank PR(.) of a page is calculated through the PR(.) of incoming nodes as follows. ∑ ∈ +−= )( )( )( *)1()( ij vInodev j j i vOnode vPR ddvPR Where d is a predefined parameter called the damping factor. The damping factor is usually set to 0.85 as the default value. Recently, some methods such as exponential damping, quadratic hyperbolic damping, general hyperbolic damping, empirical damping have been proposed for efficiently selecting a suitable value of the damping factor [4]. vi vj . . Figure 2.2. The idea behind PageRank algorithm )( )( j j vOnode vPR 28 Starting with arbitrary initial values assigned to each node in the graph, the PR values are recursively calculated until converged. It was proven that the final values are not influenced by the initialized values but the number of iterations to reach the convergence may be changed [7]. By the definition above, the PageRank algorithm encounters some weakness. First of all, the node x will get the bigger score if there are cycles contained in the connected component whose some of its member nodes point to x. An example is illustrated in figure 2.3. In the graph on the left-hand side, node 0 gets 4 citations, wherease nodes 10 and 6 in the other two graphs receive 3 citations. However, the PageRank generates score of nodes 10 and 6 higher than that of node 0. This is because nodes 10 and 6 are parts of a cycle. Figure 2.3: Graph toplogies in which PageRank is weak. The second is that a node score is more influenced by the scores of the incoming nodes rather than the number of the incomingedges. In Figure 2.4, for example, although node 1 gets 6 citations while node 0 gets only one citation, node 0 still has a higher score than that of node 1 as the result of the PageRank algorithm. Figure 2.4: The second type of graph topology on which PageRank is weak. 29 2.2. Introduction to Gene Network Understanding the regulation during the process of protein synthesis is one of the central issues in molecular biology [2]. The regulation machinery turns out to be determined by proteins that bind to regulatory regions along the DNA. So the genetic information contained on each gene is not sufficient for gaining insight into biological processes. Our understanding in the mechanism of a biological system requires the knowledge about regulatory network between genes, the so-called gene regulatory network. A gene regulatory network shows the interaction between genes, thereby governing the rates at which genes are transcribed into mRNA. A special protein called regulatory factor binds to a particular gene and then regulates the expression level of this gene (Figure 2.5). This gene regulation is often context sensitive, e.g., gene A upregulates gene C, but only if gene B is present as well. If a large number of measurements of the gene expression levels are available, we are able to model and reverse engineer the gene regulatory network that controls their expression level. The cDNA microarray experiments provide the data and a “genomic” viewpoint on gene expression. There exist two different types of gene expression data to reconstruct gene regulatory networks: time-series and steady-state. Various modeling techniques have been proposed for time-series and steady-state data, including Probabilistic/Boolean networks, Bayesian networks, Recurrent Neural networks and sets of differential equations. Due to a greate amount of gene expression data required, efficient computational methods are essential. The gene network reconstruction is only at its starting phase of study since 6 years [2]. It is now one of hot topics in systems biology and opens new challenges to reconstructing a large scale gene network. 30 Figure 2.5: Explanation about the regulatory relationship between genes. 2.2.1. The Boolean Network Model Originally introduced by Kauffman (Kauffman, 1969, 1974; Kauffman and Glass, 1973) the boolean network model has recently received much attention from the biology community [28,29]. In this model, each gene exists in only two state ON or OFF. The state of each gene at the next time step is determined by boolean function of the state of several genes at the current time step. The boolean network is represented as a pair (V,F) where V = {x1, . . . , xn} denotes the states of n genes and F = (f1, . . . , fn) is a list of Boolean function. Each state xi may be in one of two possible values 1 or 0 corresponding to the expressed state or not respectively of gene i. Boolean networks were shown useful to understand insights in the behavior of large interconnected networks. Even though boolean networks use a binary representation for continuous domain, a variety of algorithms have been developed for infering boolean networks, Somogyi et al. [27]. Moreover, several extensions of boolean networks such as noisy boolean network were proposed in the work of Akutsu et al. [1,2]. 31 Figure 2.6: A simple Boolean Network Model 2.2.2. Probabilistic Boolean Networks With a boolean network the state of the target gene is determined with full certainty from the state of the regulatory genes. However, there always exist uncertainty in the gene expression data. Based on boolean networks, the probabilistic boolean networks (PBN) [28,29] is capable of adapting these uncertainties not only in the data but also in the model selection. The PBN is obtained by extending the boolean network to have more than one boolean function for each node. We denote Fi={ fj }, j=1,..., li as the set of all possible boolean functions and li as the number of such functions for gene xi. At a given time point the PBN is characterized by a vector of boolean functions, where the ith element is selected as the predictor at that time point for gene xi. This vector is actually a standard boolean network operating at that time and called “realization” of the PBN. Here, the main problem is to learn the individual selection probabilities Cj of each element in a realization and the mechanism to calculate the probability Pi that the ith boolean network or realization is selected. Shmulevich et al.,(2002) pointed out that Pi can be easily expressed in terms of the individual selection probabilities Cj. 2.2.3. Bayesian Networks Based on the assumption of conditional independence, Bayesian networks (Pearl, 1988) are becoming a promising tool for analyzing gene expression patterns including reconstruction of gene regulatory network. A Bayesian network constructed for a set of random variables X is a pair B=(G,Q). The first component, G, is a directed acyclic graph (DAG) in which each vertex corresponds to the 32 random variables x1, . . . , xn, and each edge represents a direct dependency between two variables. The graph G sastifies the Markov assumption, that each variable xi is independent of its nondescendants given its parents G. This allows the joint distribution to be factorized according to the conditional independence encoded in G. As the result of the factorization process, the number of required parameters are reduced. That is where Pai denotes the set of all parents of variable Xi in the graph G. The second component Q represents the set of the network’s parameters describing the conditional distribution for each variable, given its parents in G. That is )|( jiikiiijk paPaxXP ===θ for each possible state kix of Xi and each configuration jipa of iPa . There are two main problems concerned the reconstruction of Bayes network, that are learning the Bayes Network and Inference. The former, learning the Bayesian networks, is the process of inferring the topology for the Bayesian network that may have generated the data together with the corresponding uncertainty distribution given the data. More specifically, there are two phases in learning Bayesian networks. The first phase is to construct the network structure as an acyclic directed graph (DAG) whereas the second is to learn parameters of that network. It is obvious that inferring a suitable structure is more important than estimating accurate parameters. The number of Bayesian network topologies grows exponentially with the number of variables. Particularly, the number of different DAGs over n nodes is given by the Robinson's formula [16] 33 Chickering et al.(1994) showed that determining an optimal Bayesian network structure is NP-hard [8]. Therefore, the greedy search strategies like hill climbing or beam search are often used to generate the initial topology. The key idea behind the hill climbing search is to obtain the best network from the neighbors of the current network using traversal operators. Instead of going only in one search direction in hill climbing, the beam search strategy uses several directions, each in turn being a hill climbing search by itself. The greedy strategies have the tendency to be trapped into a local maxima and depend on the step size. Hencen, several variations have been proposed to overcome that disadvantage. For example, the background knowledge such as the whole or partial structure of gene network or an ordering among the variables can be considered as a part of the learning algorithms. The genetic algorithm is another strategy that can be applied to Bayesian network structures [16]. 2.2.4. Additive regulation models The additive regulation models [2] assume that the change in each variable over time is given as a linear weighted sum of all other variables (Figure 2.7) as follows: Figure 2.7: Additive regulation models Where yi is the level of the ith variable, bi is a bias constant indicating whether i is expressed or not in the absence of regulatory inputs, and weight wji represents the influence of variable j on variable i. For a continuous-time system we get the following differential equation: 34 Since most genes exhibit a sigmoidal dose response curve, i.e., the gene activation at first increases slowly, then more rapidly, and finally saturates at a maximum level. The above formula is therefore modified with sigmoidal transfer function: where S(.) is the sigmoidal function, e.g. Moreover the decay rate of gene products, Di for gene i, plays an important role in their regulation. Hence, this factor also needs to be integrated to the model for gene regulatory network. As the result of this, the extension of the additive regulatory model becomes: 35 Chapter 3 Real data analysis and discussion 3.1. The proposed scheme for gene selection in sample classifying problem Every genetic disease is caused by sequence variant of many genes. However, among these genes, only a few of them play an important role in causing disease. For the purpose of specifying disease genes, Lude Franke et al.(2006) developed a functional human gene network that comprises known interaction derived from the Biomolecular Interaction Network Database (BIND), the Human Protein Reference Database (HPRD), Reactome, and the Kyoto Encyclopedia of Genes and Genomes (KEGG). After that an empirical p-value is calculated for each gene based on the gene network and then used as a score to rank all genes ralated to this type of disease. Starting from the view point of graph based ranking algorithms, the thesis proposes a method for ranking the genes causing a specific type of disease, special characteristics of samples under a particular condition. It employs the graph-based ranking algorithms on the gene regulatory networks to rank all genes. Supposed that if gene A is regulated by gene B then gene B will be said to play more important role than gene A. Therefore, we will use the Hub score as the importance measurement of each gene. This means that the more number of genes a particular gene regulates, the more important this gene is. In gene expression analysis, a gene is considered as a factor causing a particular types of samples, after the processing of comparision expected value of this gene in this particular class type over that in the other type. Large difference denotes gene with differential expression, it means that this gene is decisive factor. However the high expected value may be yielded from genes with the difference between expression values is high. There is a fact that with the same expected value 36 in different class types of samples, one gene may be informative for a particular class type while not for the other. Starting from this idea, we assume that genes whose expression values are stable over all samples belonging to a particular type. Being stably expressed is statistically represented through the variance. The smaller the variance value is, the more stable the gene is expressed. We then apply hypothesis tesing to confirm whether the variance value is less than or equal to a given small threshold value θ. This hypothesis is accepted or rejected based on the value of the sample variance 2xS derived from a random sample of size n with the test statistic defined by θ 2 )1( x S nx −= . This follows a Chi-Square distribution with n - 1 degrees of freedom. From this test, we obtain a list of all genes whose expression values are stable in all samples of each class. Furthermore, we take the intersect of the above list and the list of top-ranking genes as decisive factors causing the disease type. The remainder of this set is filled in with a gene in the set of genes with high rank score whose correlation coefficient with another one belong to set of stable genes set is greater than predefined threshold value (Figure 3.1). Figure 3.1: High score genes in assosiation with stability feature For the problem of classifying the tumor samples using gene expression data, it’s obvious that among a large number of genes there exist only small subset of informative genes in the determination of the sample class. Hence, the process of gene selection is needed. In the following (Figure 3.2), we explain the proposed scheme for gene selection mechanism. With each class of sample ci (e.g., cancer type), after applying the graph-based ranking algorithm, only a predefined number 37 ki of genes with highest ranking scores are selected as the representative feature set fi of the class ci. This is repeated over all classes of samples. Afterwards all the representative feature sets fi are aggregated with each other in order to generate a final representative feature set. Based on the idea that genes are scored by graph based ranking algorithm as the important ones of more than one class types are likely to be not the decisive factors causing of the particular class type, the aggregation operation is chosen as the union of all complementary parts of all feature sets fi. It’s denoted as XOR operation in Table 3.1. Moreover UNION (Table 3.1) operation where the final feature set f is the union of all feature set fi is also considered. 3.2. Developing Environment We have developed a program using C++ language. The missing values are simply pre-processed by replacement with the expected value of the others remaining expression levels in the corresponding gene profile. It is easily realized that the problem of reconstructing the gene regulatory networks is extremely hard and time-consuming, especially when the number of is up to thousands. Recently a lot of research works are published efficiently build the gene regulatory networks. However, these methods are still restricted for maximally hundreds of genes. Final feature set f Representative feature set f1 . Construct gene network from all samples belong to class c1 Ranking all genes Construct gene network from all samples belong to class c2 Construct gene network from all samples belong to class cn Ranking all genes Ranking all genes Aggregation operator Figure 3.2. The proposed method for gene selection in sample classifying 38 Due to this difficulty, the thorough research on reconstructing the gene regulatory network is beyond the scope of this thesis. For the purpose of simplicity, the thesis builds the gene regulatory network based on the measurement of correlation between two genes. Two strategies are implemented: one for the graph- based ranking algorithm used only and another one for taking into account the stability of gene expression values with the graph-based ranking algorithm. To assess the efficiency of the proposed gene selection scheme, the kNN (k-Nearest Neighbors) classifier is used. The dataset is the gene expression data of yeast Saccharomyces cerevisiae ( Microarray Laboratory at Standford University. They include 2467 genes over all samples belonging to two specific class types: experiment conditions CDC and ELU. The analysis randomly splits the dataset into two subset, training and testing set. The training data contains 16 Samples in which 8 samples belong to CDC class and 8 samples belong to ELU class. The testing data consist of 13 samples in which 7 samples belong to CDC class and 6 samples belong to ELU class. The second analysis uses the benchmark gene expression dataset of two subtypes of Leukemia cancer disease, i.e., ALB and ALT for B-cells and T-cells respectively. In this dataset, 19 samples belong to class ALB and 8 samples belong to class ALT. This dataset is freely available at cancer/. 3.3. Analysis results The table 3.1 is results from the analysis conducted with kNN classifier. Each value in a cell denotes the number of false positive misclassifications occurring within the test in the corresponding row using a specific value of k for kNN classifier in the corresponding column. The table shows that the proposed gene selection scheme exhibits a lower false positive misclassification number when k is greater than 10, especially when k equal to 15. Moreover, combining the stability of gene expression values with graph-based ranking algorithm gained even better results than the case without considering this stability computation. 39 Table 3.1: The experiment result of kNN classifier in partner with gene selection 40 A good scheme for gene selection especially makes sense in the case of cDNA microarray datasets where number of genes usually exceeds one thousand. And almost all classification algorithms require a great amount of computation when processing data objects with large number of variables (genes). As the result of proposed scheme (Table 3.1), the number of genes are significantly reduced to hundreds of genes compared to the original set of more than one thousand genes. Table 3.1 shows that even when the number of genes is below 1.000, the accuracy of the classification in some cases is still equivalent or better than that of the case when not using gene selection. The scheme mentioned in Figure 3.2 is implemented based on the undirected network of genes. Definitely this structure does not really reflect the regulatory gene network where all edges are directed. This might be the main reason why the proposed scheme (Table 3.1) is not always improve the result but only in some case. Hence, the proposed scheme is meaningful in literature. Apart from reducing the computational burden by decreasing in the number of genes (Golub,T.R., 1999), the proposed method also makes sense in the biological aspect. The method can be applied to find important genes causing a particular disease type. Due to the global information is considered by the graph- based algorithms, the method is especially well-fitted where the expression level of a particular gene depends on the occurrences of all other genes. Therefore, the obtained gene scores are meaningful to some certain extent. Genes with highest scores may be considered as the key factors leading to the development of cancer type in the studied samples. This will clearly help medical researchers find out an appropriate method to best prevent the affect of the diseases. The analysis ranke 999 genes within two gene expression datasets of two subtypes of Leukemia cancer disease, that is ALB and ALT for B-cells and T-cells respectively and identified 10 genes with highest hub ranking scores in both. Within both sets of 10 genes, there is one gene in each which is observed to play an important role in causing the Leukemia cancer disease [36]. Those genes are Cytoplasmic dynein light chain 1 with accession number U32944 and NADPH- flavin reductase with accession number D26308 for ALB and ALT subtypes respectively (Figure 3.1). Although gene network characteristic is still limited, we 41 recovered one gene that causes the Leukemia cancer disease in the top ten of genes with highest hub ranking score. This shows the benefit of the proposed method in discovering the important genes causing the cancer type. Figure 3.1: Ten highest ranking score genes obtained from experiments 42 Conclusion and Future Works In summary, the thesis has introduced all concepts concerned with the cDNA microarray technology, an interesting field of study with a lot of challenges for a long term research. The cDNA microarray technology has been shown to have a lot of applications in life sciences, especially in cancer researches. Moreover, the thesis has proposed a new approach to realize genes that play an important role in causing cancers using gene expression data. A new scheme for gene selection process was also introduced. The efficiency and biological aspects of the method were tested and discussed in chapter 3. Not only making sense in the process of classifying tumor samples, but also the gene selection scheme is meaningful in future processing of the other machine learning techniques such as clustering, discovering pattern or model, etc … It is obvious that the number of genes after gene selection process is reduced significantly. Due to limitation of time, only HITS algorithm is implemented to evaluated the proposed method. Page-Rank algorithm needs to be deployed in the future. With the characteristic of undirected graph, the current gene network shows similarity to a social network. Therefore, we can also employ the strategy for social networks. Moreover, analysis result is only compared with that obtained from kNN classifier. This can definitely be extended to other classifiers such as Naïve Bayes, SVM, Decision Tree. Finally, the technology will be developed to measure the protein abundance. The proposed method is surely to analyzed the protein expression data. 43 REFERENCES [1]. Akutsu, T., Miyano, S., and Kuhara, S. Identifcation of genetic networks from a small number of gene expression pattern under the Boolean network model. In Altman et al. 16, pp. 17 - 28. [2]. Akutsu, T., Miyano, S., and Kuhara, S. Algorithms for inferring qualitative models of biological networks. Proc. Pacific Symposium on Biocomputing, 290-301(2000) [3]. Andrew D Keller, Michel Schummer, Walter L Ruzzo, Lee Hood, "Bayesian classification of DNA array expression data", 08-01 (Computer Science and Engineering, Univ Washington, Aug 2000). [4]. Baeza-Yates, R., Boldi, P. and Castillo, C., Generalizing PageRank: Damping functions for link-based ranking algorithms. In Proceedings of SIGIR, Seattle, Washington, USA, ACM Press, August 2006. [6]. Bengtsson, H., Introduction to cDNA Introduction to cDNA microarray analysis, Lund University, Sweden [7]. Brin, S., Page, L., The Anatomy of a Large-scale Hypertextual Web Search Engine, Proceedings 7th WWW Conference, 107–117 (1998). [8]. Chickering, D. M., Geiger, D., and Heckerman, D.. Learning Bayesian networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research, 1994. [9]. Chow, M.L., Moler, E.J., and Mian, I.S. Identifying marker genes in transcription profiles data using a mixture of feature relevance experts. Physiol. Genomics, 5: 99-111(2001) [10]. Crammer, K., and Singer, Y., A New Family of Online Algorithms for Category Ranking, Proceedings of the 25rd Conference on Research and Development in Information Retrieval (SIGIR), 151-158 (2002). Tampere, Finland. [11]. Crammer, K. and Singer, Y., PRanking with Ranking, Proceedings of the Fourteenth Annual Conference on Neural Information Processing Systems (NIPS), 641-647 (2001) [12]. Dang Thanh Hai, Nguyen Thu Trang, Ha Quang Thuy, Graph of Concepts Based Text Summarization, The 9th National Conference on Information Technology of Vietnam, 6/2006 44 [13]. Dang Thanh Hai, Nguyen Huong Giang, Ha Quang Thuy, Naive Bayes text classification algorithm and problem of specifying clasifying threshold in search engine, Journal of Computer Science And Cybernetics 21(2):, 152-161 (2005) [14]. Dudoit, S. Fridlyand, J.& Speed, T. P. Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data J. Am. Stat. Assoc. 97: 77–87(2002). [15]. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research 4: 933–969(2003). [16]. Friedman, N. and Koller, D. Being bayesian about network structure, in C.Boutilier and M.Godszmidt (eds), Uncertainty in Articial Intelligence, Morgan Kaufmann Publishers, 201-210 (2000). [17]. Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov, J.P., Coller,H., Loh,M.L., Downing,J. Caligiuri,M. A Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286: 531–537. [18]. Hollmén,J., Tresp, V. and Simula, O., A learning vector quantization algorithm for probabilistic models. In Proceedings of EUSIPCO 2000 - X European Signal Processing Conference ,Volume II, 721 –724. [19]. [20]. [21]. Kleinberg, J., Authoritative Sources in a Hyperlinked Environment, Journal of the ACM 46 (5): 604–632 (1999). [22]. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., The Web as a Graph: Measurements, Models, and Methods, Proceedings 5th COCOON Conference, 1–17 (1999). [23]. Lempel, R., Moran, S., SALSA: the Stochastic Approach for Link-structure Analysis, ACM Transactions on Information Systems 19 (2) 131–160 (2001). [24]. Machine Learning, Tom Mitchell, McGraw Hill, 1997. [25]. Page, L., Brin, S., Motwani, R. and Winograd, T., The PageRank citation ranking: bringing order to the Web. Tech. report, Stanford University, 1998. [26]. Raychaudhuri, S., Stuart JM, Altman RB, Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac. Symp. Biocomput., 455–466 (2000). 45 [27]. Somogyi, R., Fuhrman, S., Askenazi, M., and Wuensche, A. The gene expression matrix: towards the extraction of genetic network architectures. Proceedings of the Second World Congress of Nonlinear Analysts (WCNA96), vol. 30 of Nonlinear Analysis, Pergamon Press (1996). [28]. Shmulevich, E.R. Dougherty, and W. Zhang, From Boolean to probabilistic Boolean networks as models of genetic regulatory networks, Proceedings of the IEEE, 90 (11): 1778-1792 (2002). [29]. Shmulevich, E., Dougherty, R., Kim, S., Zhang, W., Probabilistic Boolean Networks: A Rule-based Uncertainty Model for Gene Regulatory Networks, Bioinformatics, 18 (2): 261-274 (2002). [30]. Sidiropoulos A., Manolopoulos Y., Generalized Comparison of Graph-based Ranking Algorithms for Publications and Authors, Journal for Systems and Software, 79 (12): 1679-1700 (2006) [31]. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T.,Tibshirani, R., Botstein, D., and Altman, R.B., Missing value estimation methods for DNA microarrays, Bioinformatics, 17(6): 520-525 (2001) [32]. van Dijk, S., Thierens, D. and van der Gaag, L. C. Building a GA from design principles for learning Bayesian networks, Proceedings of the Genetic and Evolutionary Computation Conference, volume 2723 of Lecture Notes in Computer Science, 2003. [33]. von Haeseler Arndt, Introduction to Bioinformatics Course, Lecture Notes, Hanoi, 10-2005. [34]. William W. Cohen, Robert E. Schapire, Yoram Singer, Learning to Order Things. Advances in Neural Information Processing Systems, 1999 [35]. Werner Dubitzky, Martin Granzow, C. Stephen Downes, Daniel Berrar, “Pratical approach to Microarray Analysis”, Oxford University Press, 1999 [36]. Zhong Guan and Hongyu Zhao, A semiparametric approach for marker gene selection based on gene expression data, Bioinformatics 21(4):529-536 (2005)

Các file đính kèm theo tài liệu này:

MSc07_Dang_Thanh_Hai_Thesis_English.pdf