FOREWORD .1
CHAPTER 1 .3
INTRODUCTION TO GENE EXPRESSION DATA .3
1.1. GENE EXPRESSION 3
1.2. DNA MICROARRAY EXPERIMENTS .5
1.3. HIGH-THROUGHPUT MICROARRAY TECHNOLOGY .8
1.4.MICROARRAY DATA ANALYSIS .12
1.4.1. Pre-processing step on raw data .14
1.4.1.1 Processing missing values 14
1.4.1.2. Data transformation and Discretization 15
1.4.1.3. Data Reduction .16
1.4.1.4. Normalization .17
1.4.2. Data analysis tasks 18
1.4.2.1. Classification on gene expression data .18
1.4.2.2. Feature selection .21
1.4.2.3. Performance assessment .21
1.5. RESEARCH TOPICS ON CDNA MICROARRAY DATA 22
CHAPTER 2 .25
GRAPH BASED RANKING ALGORITHMS WITH GENE
NETWORKS 25
2.1. GRAPH BASED RANKING ALGORITHMS .25
2.2. INTRODUCTION TO GENE NETWORK 29
2.2.1. The Boolean Network Model .30
2.2.2. Probabilistic Boolean Networks .31
2.2.3. Bayesian Networks 31
2.2.4. Additive regulation models .33
CHAPTER 3 .35
REAL DATA ANALYSIS AND DISCUSSION .35
3.1. THE PROPOSED SCHEME FOR GENE SELECTION IN SAMPLE
CLASSIFYING PROBLEM 35
3.2. DEVELOPING ENVIRONMENT 37
3.3. ANALYSIS RESULTS .38
REFERENCES 43
50 trang |
Chia sẻ: maiphuongtl | Lượt xem: 1696 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Đề tài The Biological Sample Classification Using Gene Expression Data, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
then be eliminated or not out of
final subset based on whether it sastifies some predefined criteria such as
information gain and entropy-based measure, statistical tests or interdependence
analyses. In most situations, as the result of selection methods, the good set of
17
variables obtained may contain the correlating genes. Moreover there are some
genes filtered out that only expose their meaningfullness in conjunction with other
genes (variables).
Taking into account more than one genes (variables) at once, the multivariate
feature selection methods such as cluster analysis techniques, and multivariate
decision trees compute a correlation matrix or covariance matrix to detect
redundant and correlated variables. In the covariance matrix, the variables with
large values tendency tend to have large covariance scores. The correlation matrix
is calculated in the same fashion but the value of elements are normalized into the
interval of [-1, 1] to eleminate the above effect of large values of variables [35].
The original set of genes (variables) can be reduced by the procedure that
merges the subset of highly correlated genes (variables) into one variable so that
the derived set contains the mutually largely uncorrelated variables but still reserve
the original information content. For example, we can replace a set of gene or array
profiles highly correlated by some average profile that conveys most of the profiles'
information.
Besides, the Principal Component Analysis (PCA) methods summarizing
patterns of correlation, and providing the basis for predictive models is a feature-
merging method commonly used to reduce microarray data [26].
1.4.1.4. Normalization
Ideally, the expression matrix contains the true level of transcript abundance
in the measured gene-sample combination. However, because of naturally biased
measurement condition, the measured values usually deviate from the true
expression level by some amount. So we have measured level = truth level + error,
Where error comes from systematic tendency of the measurement instrument to
detect either too low or too high values [35] and the wrong measurement. The
former is called bias and the latter is called variance. So error is the sum of bias
and variance. The variance is often normally distributed, meaning that wrong
measurements in both directions are equally frequent, and that small deviations are
more frequent than large ones.
18
Normalization is a numerical method designed to deal with measurement
errors and with biological variations as follows. After the raw data is pre-processed
with tranformation procedure, e.g., base-2 logarithm, the resulting matri can be
normallized by multiplying each element on an array with an array-specific factor
such that the mean value is the same for all arrays. Futher requirement, the array-
specific factor must sastify that the mean for each array equals to 0 and the standard
deviation equals 1.
1.4.2. Data analysis tasks
Right after the data pre-processing step is employed, a numerical analysis
method is deployed corresponding to the scientific analysis task. The elementary
tasks can be divided into two categories: prediction and pattern-detection (Figure
1.9). Due to the scope of this thesis, only two topics classification and gene
regulatory network will be discussed in the following sections.
Prediction Pattern-detection
Classification
Regression or
Estimation
Time-series
Prediction
Clustering
Correlation analysis
Assosiation analysis
Deviation detection
Visualization
Figure 1.10: Two classes of data analysis tasks for microarry data.
1.4.2.1. Classification on gene expression data
Classification is a prediction or supervised learning problem in which the
data objects are assigned into one of the k predefined classes {c1, c2, …, ck}. Each
data object is characterized by a set of g measurements which create the feature
vector or vector of predictor variables, X=(x1,…,xg) and is associated with a
dependent variable (class label), Y={1,2,…,k }. We call the classification as binary
if k=2 otherwise as multi-classification. Informly a classifier C can be thought as a
partition of the feature space X into k disjoint and exhaustive subsets, A1, ...,Ak,
containing the subset of data objects whose assigned classes are c1, …, ck
respectively.
19
Classifiers are derived from the training set L= {(x1,y1),…,(xn,yn)} in which
each data object is known to belong to a certain class. The notation C(.; L) is used
to denote a classifier built from a learning set L [24]. For gene expression data, the
data object is biological sample needed to be classified, features correspond to the
expression measures of different genes over all samples studied and classes
correspond to different types of tumors (e.g., nodal positive vs. negative breast
tumors, or tumors with good vs. bad prognosis). The process of classifying tumor
samples concerns with the gene selection mentioned above, i.e., the identification
of marker genes that characterize different tumor classes.
For the classification problem of microarray data, one has to classify the
sample profile into predefined tumor types. Each gene corresponds to a feature
variable whose value domain contains all possible gene expression levels. The
expression levels might be either absolute (e.g., Affymetrix oligonucleotide arrays)
or relative to the expression levels of a well defined common reference sample
(e.g., 2-color cDNA microarrays). The main obstade encountered during the
classification of microarry data is a very large number of genes (variables) w.r.t the
number of tumbor samples or the so-called “large p, small n” problem. Typical
expression data contain from 5,000 to 10,000 genes for less than 100 tumor
samples.
The problem of classifying the biological samples using gene expression data
has becomed the key issue in cancer research. For successfullness in diagnosis and
treatment cancer, we need a reliable and precise classification of tumors. Recently,
many researchers have published their works on statistical aspects of classification
in the context of microarray experiments [14,17]. They mainly focused on existing
methods or variants derived from those. Studies to date suggest that simple
methods such as K Nearest Neighbor [17] or naive Bayes classification [13,3],
perform as well as more complex approaches, such as Support Vector Machines
(SVMs) [14]. This section will discuss the native Bayes and k Nearest Neighbours
methods. Finally we will describe issue of performance assessment.
20
The naïve Bayes classification
Suppose that the likelyhood pk(x)=p(x | Y=k) and class priors πk are known
for all possible class value k. Bayes' Theorem can be used to compute the posterior
probability p(k | x) of class k given feature vector x as
∑ == Kl ll
kk
xp
xpxkp
1
)(
)(
)|( π
π
The native Bayes classification predicts the class )(xCB of an object x by
maximizing the posterior probability
)|(maxarg)( xkpxC kB =
Depending on parametric or non-parametric estimations of p(k|x), there are
two general schemes to estimate the class posterior probabilities p(k|x): density
estimation and direct function estimation. In the density estimation approach, class
conditional densities Pk(x) = p(x | Y=k) (and priors πk) are estimated separately for
each class and Bayes' Theorem is applied to obtain estimates of p(k | x). The
maximum likelihood discriminant rules (Fisher, 1922); learning vector quantization
[18]. Bayesian belief networks [8] are examples of the density estimation. In the
direct function estimation approach, posteriors p(k | x) are estimated directly based
on methods such as regression technique [19]. The examples of this approach are
logistic regression [19]; neural networks [19]; classification trees [20] and nearest
neighbor classifiers [17].
Nearest Neighbor Classifiers
Nearest neighbor classifiers were developed by Fix and Hodges (1951).
Based on a distance measurement function for pairs of samples, such as the
Euclidean distance, the basic k-nearest neighbor (kNN) classifier classify a new
object on the basis of the learning set. First, it finds the k closest samples in the
learning set with the new object. Then, it predicts the class by majority vote, e.g.
choose the class that is most common among those k nearest neighbors.
In kNN, the number of neighbors k should be chosen carefully so as to
maximize the performance of the classifier. This is still a challenging problem for
most cases. A common approach to overcome this problem is to select some
21
specific values of k and implementing classifier on the training set w.r.t each value
of k. The error rate for each value of k is calculated based on the so-called leave-
one-out cross validation fashion. The value of k with smallest cross-validation error
rate is chosen as the best parameter for the classsifier. The nearest neighbor rule
can be refined and extended to deal with unequal class priors, differential
misclassification costs, and feature selection. Instead of the equal treatment among
the neighbors, the refinement version introduces the weighted voting scheme for
the neighbors. That is, each neighbor is assigned a weight based on its contribution
to the process of deciding the new objects’ class.
1.4.2.2. Feature selection
Feature selection as mentioned above is one of the most important issues in
classification, especially when applied to microarray data. In the machine learning
[24], feature selection can be distinguished into two categories as filter and wrapper
methods. In the former, feature selection process is performed priorily to building
of classifier. Some univariate test statistics such as: t- or F-statistics; ad hoc signal-
to-noise statistics [17]; non-parametric Wilcoxon statistics [19]; p-values are
implemented to procedure a subset of genes considered as feature set. The selection
process is implemented based on the predefined number of genes or statistical test
value cut-off. In wrapper methods, the feature selection is implemented as an
inherent part of the classifier building procedure. Different classifiers will use the
different approaches to feature selection. For example, in classification trees
(CART; Breiman et al., 1984), features are selected at each step by the procedure of
pruning the tree using cross-validation. In the nearest neighbor classification, the
automatic feature selection can be obtained by suitable modification the distance
measuring function.
1.4.2.3. Performance assessment
Obviously, different classifiers exhibit different accuracy rates. So it is
necessary to develop a technique to reliably estimate of the classification error. As
a result, it guarantees the best classifier will be chosen for latter implementation.
This is very important to the problem of classifying tumor samples, because the
misclassification could lead to misdiagnosis and assignment to improper treatment
protocol.
22
For the purpose of performance assessment, we need a test set of labeled
objects which are the samples independent from the available learning set. The
classifier is applied on the test set and the classification error rate will be calculated
as the proportion of test cases with discordant prediction to the true class labels. In
practice, the original learning set L is randomly divided into V subsets Lv, v=1,
….,V of nearly equal size. The sets L-Lv is used as training set for building the
classifiers and error rates are computed from validation sets Lv. This procedure is
repeated over each subset Lv and the average error is obtained. This scheme for
performance estimation is called Cross-Validation Estimation. It has a limitation
that it reduces effective sample size for training purposes especially in the
microarray data since the number of samples is relatively small.
1.5. Research topics on cDNA microarray data
cDNA microarray experiment is concerned with a lot of research fields.
Differential Gene Expression is the first field of study, which investigates only a
single gene profile to find those genes that expose different expression levels under
different experimental conditions. These include different tissue types on different
developmental stages of the organism. Its frequent application is in the
investigation of gene expression level under normal-versus-diseased state.
The second research field is gene co-regulation study, that mainly focuses on
comparision more than one gene profiles to identify genes whose expression levels
are correlated in certain experiment conditions. The gene co- regulation has two
basic forms: positive and negative. Two genes are called to have a positive co-
regulation if their expression levels indicate the same increasing or decreasing
tendency. They imply a negative co-regulation if the tendency is on the opposite
direction. For example, the genes a and c in our four-gene example (Table 1.6)
exhibit a negative co-regulation pattern across the ten studied samples.
The third field of study is gene function identification. Here the function of
new genes can be revealed through the process of comparizing its expression
profile against the profiles of genes with known functions. The prior known
23
functions of genes with high similarity will be used for inferring the function of the
new gene.
The next topic is to study the inside cells to understande gene functions. The
fourth research field, Identification of Pathways and Gene Regulatory Networks,
will help to answer it. This helps biologists understand the processes by which
genes and their products (i.e. proteins) play their roles in cells, tissues, and
organisms. Given a pathway, the changes in expression level of investigated genes
are monitored to identify pathways. Moreover, the gene expression level is also
regulated by the products of other genes. This means that there exists a gene
regulatory network by which a gene activities are regulated by others. Therefore,
reconstructing gene regulatory network is the main goal of this field of study.
These studies require time-course microarray data to be generated.
Fifthly, in clinical diagnosis, the cDNA microarry data is useful for
discovering expression patterns that are characteristics of a particular disease and
also useful for the inference unknown subtypes of known diseases. This is achieved
by revealing characteristically different expression profiles that correlate with
clinically distinct subtypes of a disease.
For each different disease already identified, suppose that we know a
chemical compounds used to cure this disease. It is necessary to study the different
changes in gene expression pattern in response to different dosages of this chemical
compound. This is the goal that the sixth research field, dose-response study, must
achieve.
The diseases are always correlated with some common variations such as
insertions, deletions of a few nucleotides in sequences or in the repeated number of
a motif. It is any sequence pattern that decides the molecule's function, structural
feature, or family membership. The seventh research field called sequence
variantion will takes advantages from cDNA microarry data to discover these
variations. These common type of sequence variation are called single nucleotide
polymorphism (SNP) [35] which occur with a frequency of roughly 1 per 1,000
nucleotides. The microarray experiments can be useful for the good analyzing SNP
variation if at least three following categories in designing the microarray
24
experiment are qualified: (a) arrays including all known SNPs of a human genome
sequence, (b) microarrays containing a sample of SNPs located across the entire
human genome, and (c) devices for re-sequencing a sample of the human genomic
sequence.
Finally, the cDNA microarray data is time series sometimes where the
expression level of each gene are repeatedly measured after an interval of time
from the same source. The eighth research field of study, time-course study, is
given the birth for the purpose of processing this special temporary characteristics
of time-series data. This study is also usefull in studing cell cycle phenomena and
also in reconstructing the gene network.
25
Chapter 2
Graph based ranking algorithms with gene networks
2.1. Graph Based Ranking Algorithms
Ranking is the problem of ordering a given set such that more important
elements come first in the order list. With X denoting the considered set of object,
the ranking function is usually defined as a function f: X Æ R that assigns a
ranking value to each object such that more important objects will receive higher
ranking scores.
The problem of ranking has recently gained much attention in machine
learning [10,11, 15,34]. Although the main form of data considered so far is vector-
valued data, but sometimes there are the intrinsic realationships exist among
objects in the set. The Webpages, journals and conferences, publications and
scholar authors are best illustrations for this intrinsic relationships, i.e., link/citation
relationships [7,21, 30]. Here each object corresponds to a node in the graph and
the intrinsic relationships are described by directed edges between the nodes. This
characteristic is taken into account by graph based ranking algorithms to improve
the accuracy of the ranking result. Being used in the Information Retrieval systems,
the graph based ranking algorithm take advantages of anchor links among
webpages to generate a of webpages with the highest relevance to the interested
topics required by end users from millions of pages. The graph based ranking
algorithm is also used to compute the impact factor of the journals and conferences.
It represents each journal or conference as a vertice. Each weighted edge between
these vertices represent the total number of citations from one journal or conference
to another. Apart from this, the honour ranking of paper’s authors can also be
calculated.
PageRank by Brin and Page [7] and Hypertext Induced Topic Selection
(HITS) by Kleinberg [21,22] are the most successful graph-based ranking
algorithms. For the rest of the thesis, we will use the following notations.
26
G=(V,E) denotes a directed graph with the set of vertices V and set of edges E. The
Inode(v) represents the set of all nodes pointing to the node v and Onode(v) are the
set of all nodes that the node v point to. (Figure 2.1)
HITS Algorithm
Originally, the HITS algorithm is applied to web documents. Here, the
documents represent the vertices in the graph and the links between them represent
them represent the edges. The HITS algorithm assigns each vertex vi a so called
authority score a(vi) and a hub score h(vi). The algorithm defines a good hub as a
vertex with many outgoing edges, and a good authority as a vertext receiving a lot
of incoming edges. Hubs and authorities exhibit a mutual relationship: a better hub
links to many good authorities, and a better authority is pointed to by many good
hubs. The authority and hub score of each vertex is recursively calculated
according to the global information of all vertices in the graph rather than local one
of this node as the following:
∑
∑
∈
∈
=
=
)v(Onodev
ji
)v(Inodev
ji
ij
ij
)v(a)v(h
)v(h)v(a
The authority and hub scores were proven to be converged after a number of
iterations. Moreover, each edge connecting two vertices shows the different affect
of the pointing node to the pointed one. If each edge is assigned a particular weight,
the HITS algorithm is extended as the following equations:
G
Outcoming
edges
Incoming
edges
v
.
.
.
.
.
.
Inode(v)
………
Onode(v)
………
Figure 2.1. Graph, incoming and outcoming nodes
27
∑
∑
∈
∈
=
=
)v(Onodev
jiji
)v(Inodev
jjii
ij
ij
)v(aw)v(h
)v(hw)v(a
PageRank Algorithm
Introduced by Page et al. [25] as the solution for ranking the webpages, the
PageRank algorithm computes the importance of a page based on the following
recommendation: “a page with high PageRank is a page referenced by many pages
with high PageRank”. Being simple and robust, this algorithm is implemented as
the part of many today’s commercial search engines such as Google. PageRank
score of vertex can be considered as a “vote” for its importance by all the other
vertext pointing to it. For the vertex vj whose number of out links is )( jvOnode ,
each of these out links will convey the vote value )(
)(
j
j
vOnode
vPR
for the vertex to
which it points (Figure 2.2). The PageRank PR(.) of a page is calculated through
the PR(.) of incoming nodes as follows.
∑
∈
+−=
)( )(
)(
*)1()(
ij vInodev j
j
i vOnode
vPR
ddvPR
Where d is a predefined parameter called the damping factor. The damping
factor is usually set to 0.85 as the default value. Recently, some methods such as
exponential damping, quadratic hyperbolic damping, general hyperbolic damping,
empirical damping have been proposed for efficiently selecting a suitable value of
the damping factor [4].
vi
vj
.
.
Figure 2.2. The idea behind PageRank algorithm
)(
)(
j
j
vOnode
vPR
28
Starting with arbitrary initial values assigned to each node in the graph, the
PR values are recursively calculated until converged. It was proven that the final
values are not influenced by the initialized values but the number of iterations to
reach the convergence may be changed [7].
By the definition above, the PageRank algorithm encounters some weakness.
First of all, the node x will get the bigger score if there are cycles contained in the
connected component whose some of its member nodes point to x. An example is
illustrated in figure 2.3. In the graph on the left-hand side, node 0 gets 4 citations,
wherease nodes 10 and 6 in the other two graphs receive 3 citations. However, the
PageRank generates score of nodes 10 and 6 higher than that of node 0. This is
because nodes 10 and 6 are parts of a cycle.
Figure 2.3: Graph toplogies in which PageRank is weak.
The second is that a node score is more influenced by the scores of the
incoming nodes rather than the number of the incomingedges. In Figure 2.4, for
example, although node 1 gets 6 citations while node 0 gets only one citation, node
0 still has a higher score than that of node 1 as the result of the PageRank
algorithm.
Figure 2.4: The second type of graph topology on which PageRank is weak.
29
2.2. Introduction to Gene Network
Understanding the regulation during the process of protein synthesis is one
of the central issues in molecular biology [2]. The regulation machinery turns out to
be determined by proteins that bind to regulatory regions along the DNA. So the
genetic information contained on each gene is not sufficient for gaining insight into
biological processes. Our understanding in the mechanism of a biological system
requires the knowledge about regulatory network between genes, the so-called gene
regulatory network.
A gene regulatory network shows the interaction between genes, thereby
governing the rates at which genes are transcribed into mRNA. A special protein
called regulatory factor binds to a particular gene and then regulates the expression
level of this gene (Figure 2.5). This gene regulation is often context sensitive, e.g.,
gene A upregulates gene C, but only if gene B is present as well. If a large number
of measurements of the gene expression levels are available, we are able to model
and reverse engineer the gene regulatory network that controls their expression
level.
The cDNA microarray experiments provide the data and a “genomic”
viewpoint on gene expression. There exist two different types of gene expression
data to reconstruct gene regulatory networks: time-series and steady-state. Various
modeling techniques have been proposed for time-series and steady-state data,
including Probabilistic/Boolean networks, Bayesian networks, Recurrent Neural
networks and sets of differential equations. Due to a greate amount of gene
expression data required, efficient computational methods are essential. The gene
network reconstruction is only at its starting phase of study since 6 years [2]. It is
now one of hot topics in systems biology and opens new challenges to
reconstructing a large scale gene network.
30
Figure 2.5: Explanation about the regulatory relationship between genes.
2.2.1. The Boolean Network Model
Originally introduced by Kauffman (Kauffman, 1969, 1974; Kauffman and
Glass, 1973) the boolean network model has recently received much attention from
the biology community [28,29]. In this model, each gene exists in only two state
ON or OFF. The state of each gene at the next time step is determined by boolean
function of the state of several genes at the current time step. The boolean network
is represented as a pair (V,F) where V = {x1, . . . , xn} denotes the states of n genes
and F = (f1, . . . , fn) is a list of Boolean function. Each state xi may be in one of
two possible values 1 or 0 corresponding to the expressed state or not respectively
of gene i. Boolean networks were shown useful to understand insights in the
behavior of large interconnected networks. Even though boolean networks use a
binary representation for continuous domain, a variety of algorithms have been
developed for infering boolean networks, Somogyi et al. [27]. Moreover, several
extensions of boolean networks such as noisy boolean network were proposed in
the work of Akutsu et al. [1,2].
31
Figure 2.6: A simple Boolean Network Model
2.2.2. Probabilistic Boolean Networks
With a boolean network the state of the target gene is determined with full
certainty from the state of the regulatory genes. However, there always exist
uncertainty in the gene expression data. Based on boolean networks, the
probabilistic boolean networks (PBN) [28,29] is capable of adapting these
uncertainties not only in the data but also in the model selection. The PBN is
obtained by extending the boolean network to have more than one boolean
function for each node. We denote Fi={ fj }, j=1,..., li as the set of all possible
boolean functions and li as the number of such functions for gene xi. At a given
time point the PBN is characterized by a vector of boolean functions, where the ith
element is selected as the predictor at that time point for gene xi. This vector is
actually a standard boolean network operating at that time and called “realization”
of the PBN. Here, the main problem is to learn the individual selection probabilities
Cj of each element in a realization and the mechanism to calculate the probability Pi
that the ith boolean network or realization is selected. Shmulevich et al.,(2002)
pointed out that Pi can be easily expressed in terms of the individual selection
probabilities Cj.
2.2.3. Bayesian Networks
Based on the assumption of conditional independence, Bayesian networks
(Pearl, 1988) are becoming a promising tool for analyzing gene expression patterns
including reconstruction of gene regulatory network. A Bayesian network
constructed for a set of random variables X is a pair B=(G,Q). The first component,
G, is a directed acyclic graph (DAG) in which each vertex corresponds to the
32
random variables x1, . . . , xn, and each edge represents a direct dependency between
two variables. The graph G sastifies the Markov assumption, that each variable xi is
independent of its nondescendants given its parents G. This allows the joint
distribution to be factorized according to the conditional independence encoded in
G. As the result of the factorization process, the number of required parameters are
reduced. That is
where Pai denotes the set of all parents of variable Xi in the graph G.
The second component Q represents the set of the network’s parameters
describing the conditional distribution for each variable, given its parents in G.
That is )|( jiikiiijk paPaxXP ===θ for each possible state kix of Xi and each
configuration jipa of iPa .
There are two main problems concerned the reconstruction of Bayes
network, that are learning the Bayes Network and Inference. The former, learning
the Bayesian networks, is the process of inferring the topology for the Bayesian
network that may have generated the data together with the corresponding
uncertainty distribution given the data. More specifically, there are two phases in
learning Bayesian networks. The first phase is to construct the network structure as
an acyclic directed graph (DAG) whereas the second is to learn parameters of that
network. It is obvious that inferring a suitable structure is more important than
estimating accurate parameters. The number of Bayesian network topologies grows
exponentially with the number of variables. Particularly, the number of different
DAGs over n nodes is given by the Robinson's formula [16]
33
Chickering et al.(1994) showed that determining an optimal Bayesian
network structure is NP-hard [8]. Therefore, the greedy search strategies like hill
climbing or beam search are often used to generate the initial topology. The key
idea behind the hill climbing search is to obtain the best network from the
neighbors of the current network using traversal operators. Instead of going only in
one search direction in hill climbing, the beam search strategy uses several
directions, each in turn being a hill climbing search by itself. The greedy strategies
have the tendency to be trapped into a local maxima and depend on the step size.
Hencen, several variations have been proposed to overcome that disadvantage. For
example, the background knowledge such as the whole or partial structure of gene
network or an ordering among the variables can be considered as a part of the
learning algorithms. The genetic algorithm is another strategy that can be applied to
Bayesian network structures [16].
2.2.4. Additive regulation models
The additive regulation models [2] assume that the change in each variable
over time is given as a linear weighted sum of all other variables (Figure 2.7) as
follows:
Figure 2.7: Additive regulation models
Where yi is the level of the ith variable, bi is a bias constant indicating
whether i is expressed or not in the absence of regulatory inputs, and weight wji
represents the influence of variable j on variable i. For a continuous-time system
we get the following differential equation:
34
Since most genes exhibit a sigmoidal dose response curve, i.e., the gene
activation at first increases slowly, then more rapidly, and finally saturates at a
maximum level. The above formula is therefore modified with sigmoidal transfer
function:
where S(.) is the sigmoidal function, e.g.
Moreover the decay rate of gene products, Di for gene i, plays an important
role in their regulation. Hence, this factor also needs to be integrated to the model
for gene regulatory network. As the result of this, the extension of the additive
regulatory model becomes:
35
Chapter 3
Real data analysis and discussion
3.1. The proposed scheme for gene selection in sample
classifying problem
Every genetic disease is caused by sequence variant of many genes.
However, among these genes, only a few of them play an important role in causing
disease. For the purpose of specifying disease genes, Lude Franke et al.(2006)
developed a functional human gene network that comprises known interaction
derived from the Biomolecular Interaction Network Database (BIND), the Human
Protein Reference Database (HPRD), Reactome, and the Kyoto Encyclopedia of
Genes and Genomes (KEGG). After that an empirical p-value is calculated for each
gene based on the gene network and then used as a score to rank all genes ralated to
this type of disease.
Starting from the view point of graph based ranking algorithms, the thesis
proposes a method for ranking the genes causing a specific type of disease, special
characteristics of samples under a particular condition. It employs the graph-based
ranking algorithms on the gene regulatory networks to rank all genes. Supposed
that if gene A is regulated by gene B then gene B will be said to play more
important role than gene A. Therefore, we will use the Hub score as the importance
measurement of each gene. This means that the more number of genes a particular
gene regulates, the more important this gene is.
In gene expression analysis, a gene is considered as a factor causing a
particular types of samples, after the processing of comparision expected value of
this gene in this particular class type over that in the other type. Large difference
denotes gene with differential expression, it means that this gene is decisive factor.
However the high expected value may be yielded from genes with the difference
between expression values is high. There is a fact that with the same expected value
36
in different class types of samples, one gene may be informative for a particular
class type while not for the other. Starting from this idea, we assume that genes
whose expression values are stable over all samples belonging to a particular type.
Being stably expressed is statistically represented through the variance. The smaller
the variance value is, the more stable the gene is expressed. We then apply
hypothesis tesing to confirm whether the variance value is less than or equal to a
given small threshold value θ. This hypothesis is accepted or rejected based on the
value of the sample variance 2xS derived from a random sample of size n with the
test statistic defined by θ
2
)1( x
S
nx −= . This follows a Chi-Square distribution with n
- 1 degrees of freedom. From this test, we obtain a list of all genes whose
expression values are stable in all samples of each class. Furthermore, we take the
intersect of the above list and the list of top-ranking genes as decisive factors
causing the disease type. The remainder of this set is filled in with a gene in the set
of genes with high rank score whose correlation coefficient with another one
belong to set of stable genes set is greater than predefined threshold value (Figure
3.1).
Figure 3.1: High score genes in assosiation with stability feature
For the problem of classifying the tumor samples using gene expression data,
it’s obvious that among a large number of genes there exist only small subset of
informative genes in the determination of the sample class. Hence, the process of
gene selection is needed. In the following (Figure 3.2), we explain the proposed
scheme for gene selection mechanism. With each class of sample ci (e.g., cancer
type), after applying the graph-based ranking algorithm, only a predefined number
37
ki of genes with highest ranking scores are selected as the representative feature set
fi of the class ci. This is repeated over all classes of samples. Afterwards all the
representative feature sets fi are aggregated with each other in order to generate a
final representative feature set. Based on the idea that genes are scored by graph
based ranking algorithm as the important ones of more than one class types are
likely to be not the decisive factors causing of the particular class type, the
aggregation operation is chosen as the union of all complementary parts of all
feature sets fi. It’s denoted as XOR operation in Table 3.1. Moreover UNION
(Table 3.1) operation where the final feature set f is the union of all feature set fi is
also considered.
3.2. Developing Environment
We have developed a program using C++ language. The missing values are
simply pre-processed by replacement with the expected value of the others
remaining expression levels in the corresponding gene profile.
It is easily realized that the problem of reconstructing the gene regulatory
networks is extremely hard and time-consuming, especially when the number of is
up to thousands. Recently a lot of research works are published efficiently build the
gene regulatory networks. However, these methods are still restricted for
maximally hundreds of genes.
Final feature
set f
Representative
feature set f1
.
Construct gene network
from all samples belong to
class c1
Ranking all genes
Construct gene network
from all samples belong to
class c2
Construct gene network
from all samples belong to
class cn
Ranking all genes
Ranking all genes
Aggregation
operator
Figure 3.2. The proposed method for gene selection in sample classifying
38
Due to this difficulty, the thorough research on reconstructing the gene
regulatory network is beyond the scope of this thesis. For the purpose of simplicity,
the thesis builds the gene regulatory network based on the measurement of
correlation between two genes. Two strategies are implemented: one for the graph-
based ranking algorithm used only and another one for taking into account the
stability of gene expression values with the graph-based ranking algorithm. To
assess the efficiency of the proposed gene selection scheme, the kNN (k-Nearest
Neighbors) classifier is used.
The dataset is the gene expression data of yeast Saccharomyces cerevisiae
( Microarray Laboratory
at Standford University. They include 2467 genes over all samples belonging to
two specific class types: experiment conditions CDC and ELU. The analysis
randomly splits the dataset into two subset, training and testing set. The training
data contains 16 Samples in which 8 samples belong to CDC class and 8 samples
belong to ELU class. The testing data consist of 13 samples in which 7 samples
belong to CDC class and 6 samples belong to ELU class.
The second analysis uses the benchmark gene expression dataset of two
subtypes of Leukemia cancer disease, i.e., ALB and ALT for B-cells and T-cells
respectively. In this dataset, 19 samples belong to class ALB and 8 samples belong
to class ALT. This dataset is freely available at cancer/.
3.3. Analysis results
The table 3.1 is results from the analysis conducted with kNN classifier.
Each value in a cell denotes the number of false positive misclassifications
occurring within the test in the corresponding row using a specific value of k for
kNN classifier in the corresponding column. The table shows that the proposed
gene selection scheme exhibits a lower false positive misclassification number
when k is greater than 10, especially when k equal to 15. Moreover, combining the
stability of gene expression values with graph-based ranking algorithm gained even
better results than the case without considering this stability computation.
39
Table 3.1: The experiment result of kNN classifier in partner with gene selection
40
A good scheme for gene selection especially makes sense in the case of
cDNA microarray datasets where number of genes usually exceeds one thousand.
And almost all classification algorithms require a great amount of computation
when processing data objects with large number of variables (genes). As the result
of proposed scheme (Table 3.1), the number of genes are significantly reduced to
hundreds of genes compared to the original set of more than one thousand genes.
Table 3.1 shows that even when the number of genes is below 1.000, the accuracy
of the classification in some cases is still equivalent or better than that of the case
when not using gene selection.
The scheme mentioned in Figure 3.2 is implemented based on the undirected
network of genes. Definitely this structure does not really reflect the regulatory
gene network where all edges are directed. This might be the main reason why the
proposed scheme (Table 3.1) is not always improve the result but only in some
case. Hence, the proposed scheme is meaningful in literature.
Apart from reducing the computational burden by decreasing in the number
of genes (Golub,T.R., 1999), the proposed method also makes sense in the
biological aspect. The method can be applied to find important genes causing a
particular disease type. Due to the global information is considered by the graph-
based algorithms, the method is especially well-fitted where the expression level of
a particular gene depends on the occurrences of all other genes. Therefore, the
obtained gene scores are meaningful to some certain extent. Genes with highest
scores may be considered as the key factors leading to the development of cancer
type in the studied samples. This will clearly help medical researchers find out an
appropriate method to best prevent the affect of the diseases.
The analysis ranke 999 genes within two gene expression datasets of two
subtypes of Leukemia cancer disease, that is ALB and ALT for B-cells and T-cells
respectively and identified 10 genes with highest hub ranking scores in both.
Within both sets of 10 genes, there is one gene in each which is observed to play an
important role in causing the Leukemia cancer disease [36]. Those genes are
Cytoplasmic dynein light chain 1 with accession number U32944 and NADPH-
flavin reductase with accession number D26308 for ALB and ALT subtypes
respectively (Figure 3.1). Although gene network characteristic is still limited, we
41
recovered one gene that causes the Leukemia cancer disease in the top ten of genes
with highest hub ranking score. This shows the benefit of the proposed method in
discovering the important genes causing the cancer type.
Figure 3.1: Ten highest ranking score genes obtained from experiments
42
Conclusion and Future Works
In summary, the thesis has introduced all concepts concerned with the cDNA
microarray technology, an interesting field of study with a lot of challenges for a
long term research. The cDNA microarray technology has been shown to have a lot
of applications in life sciences, especially in cancer researches.
Moreover, the thesis has proposed a new approach to realize genes that play
an important role in causing cancers using gene expression data. A new scheme for
gene selection process was also introduced. The efficiency and biological aspects
of the method were tested and discussed in chapter 3. Not only making sense in the
process of classifying tumor samples, but also the gene selection scheme is
meaningful in future processing of the other machine learning techniques such as
clustering, discovering pattern or model, etc … It is obvious that the number of
genes after gene selection process is reduced significantly.
Due to limitation of time, only HITS algorithm is implemented to evaluated
the proposed method. Page-Rank algorithm needs to be deployed in the future.
With the characteristic of undirected graph, the current gene network shows
similarity to a social network. Therefore, we can also employ the strategy for social
networks.
Moreover, analysis result is only compared with that obtained from kNN
classifier. This can definitely be extended to other classifiers such as Naïve Bayes,
SVM, Decision Tree.
Finally, the technology will be developed to measure the protein abundance.
The proposed method is surely to analyzed the protein expression data.
43
REFERENCES
[1]. Akutsu, T., Miyano, S., and Kuhara, S. Identifcation of genetic networks from
a small number of gene expression pattern under the Boolean network model.
In Altman et al. 16, pp. 17 - 28.
[2]. Akutsu, T., Miyano, S., and Kuhara, S. Algorithms for inferring qualitative
models of biological networks. Proc. Pacific Symposium on Biocomputing,
290-301(2000)
[3]. Andrew D Keller, Michel Schummer, Walter L Ruzzo, Lee Hood, "Bayesian
classification of DNA array expression data", 08-01 (Computer Science and
Engineering, Univ Washington, Aug 2000).
[4]. Baeza-Yates, R., Boldi, P. and Castillo, C., Generalizing PageRank: Damping
functions for link-based ranking algorithms. In Proceedings of SIGIR, Seattle,
Washington, USA, ACM Press, August 2006.
[6]. Bengtsson, H., Introduction to cDNA Introduction to cDNA microarray
analysis, Lund University, Sweden
[7]. Brin, S., Page, L., The Anatomy of a Large-scale Hypertextual Web Search
Engine, Proceedings 7th WWW Conference, 107–117 (1998).
[8]. Chickering, D. M., Geiger, D., and Heckerman, D.. Learning Bayesian
networks is NP-hard. Technical Report MSR-TR-94-17, Microsoft Research,
1994.
[9]. Chow, M.L., Moler, E.J., and Mian, I.S. Identifying marker genes in
transcription profiles data using a mixture of feature relevance experts.
Physiol. Genomics, 5: 99-111(2001)
[10]. Crammer, K., and Singer, Y., A New Family of Online Algorithms for
Category Ranking, Proceedings of the 25rd Conference on Research and
Development in Information Retrieval (SIGIR), 151-158 (2002). Tampere,
Finland.
[11]. Crammer, K. and Singer, Y., PRanking with Ranking, Proceedings of the
Fourteenth Annual Conference on Neural Information Processing Systems
(NIPS), 641-647 (2001)
[12]. Dang Thanh Hai, Nguyen Thu Trang, Ha Quang Thuy, Graph of Concepts
Based Text Summarization, The 9th National Conference on Information
Technology of Vietnam, 6/2006
44
[13]. Dang Thanh Hai, Nguyen Huong Giang, Ha Quang Thuy, Naive Bayes text
classification algorithm and problem of specifying clasifying threshold in
search engine, Journal of Computer Science And Cybernetics 21(2):, 152-161
(2005)
[14]. Dudoit, S. Fridlyand, J.& Speed, T. P. Comparison of Discrimination
Methods for the Classification of Tumors Using Gene Expression Data J. Am.
Stat. Assoc. 97: 77–87(2002).
[15]. Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. An efficient boosting
algorithm for combining preferences. Journal of Machine Learning Research
4: 933–969(2003).
[16]. Friedman, N. and Koller, D. Being bayesian about network structure, in
C.Boutilier and M.Godszmidt (eds), Uncertainty in Articial Intelligence,
Morgan Kaufmann Publishers, 201-210 (2000).
[17]. Golub,T.R., Slonim,D.K., Tamayo,P., Huard,C., Gaasenbeek,M., Mesirov,
J.P., Coller,H., Loh,M.L., Downing,J. Caligiuri,M. A Molecular classification
of cancer: Class discovery and class prediction by gene expression monitoring.
Science, 286: 531–537.
[18]. Hollmén,J., Tresp, V. and Simula, O., A learning vector quantization
algorithm for probabilistic models. In Proceedings of EUSIPCO 2000 - X
European Signal Processing Conference ,Volume II, 721 –724.
[19].
[20].
[21]. Kleinberg, J., Authoritative Sources in a Hyperlinked Environment, Journal
of the ACM 46 (5): 604–632 (1999).
[22]. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A., The
Web as a Graph: Measurements, Models, and Methods, Proceedings 5th
COCOON Conference, 1–17 (1999).
[23]. Lempel, R., Moran, S., SALSA: the Stochastic Approach for Link-structure
Analysis, ACM Transactions on Information Systems 19 (2) 131–160 (2001).
[24]. Machine Learning, Tom Mitchell, McGraw Hill, 1997.
[25]. Page, L., Brin, S., Motwani, R. and Winograd, T., The PageRank citation
ranking: bringing order to the Web. Tech. report, Stanford University, 1998.
[26]. Raychaudhuri, S., Stuart JM, Altman RB, Principal components analysis to
summarize microarray experiments: application to sporulation time series.
Pac. Symp. Biocomput., 455–466 (2000).
45
[27]. Somogyi, R., Fuhrman, S., Askenazi, M., and Wuensche, A. The gene
expression matrix: towards the extraction of genetic network architectures.
Proceedings of the Second World Congress of Nonlinear Analysts (WCNA96),
vol. 30 of Nonlinear Analysis, Pergamon Press (1996).
[28]. Shmulevich, E.R. Dougherty, and W. Zhang, From Boolean to probabilistic
Boolean networks as models of genetic regulatory networks, Proceedings of
the IEEE, 90 (11): 1778-1792 (2002).
[29]. Shmulevich, E., Dougherty, R., Kim, S., Zhang, W., Probabilistic Boolean
Networks: A Rule-based Uncertainty Model for Gene Regulatory Networks,
Bioinformatics, 18 (2): 261-274 (2002).
[30]. Sidiropoulos A., Manolopoulos Y., Generalized Comparison of Graph-based
Ranking Algorithms for Publications and Authors, Journal for Systems and
Software, 79 (12): 1679-1700 (2006)
[31]. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T.,Tibshirani,
R., Botstein, D., and Altman, R.B., Missing value estimation methods for
DNA microarrays, Bioinformatics, 17(6): 520-525 (2001)
[32]. van Dijk, S., Thierens, D. and van der Gaag, L. C. Building a GA from design
principles for learning Bayesian networks, Proceedings of the Genetic and
Evolutionary Computation Conference, volume 2723 of Lecture Notes in
Computer Science, 2003.
[33]. von Haeseler Arndt, Introduction to Bioinformatics Course, Lecture Notes,
Hanoi, 10-2005.
[34]. William W. Cohen, Robert E. Schapire, Yoram Singer, Learning to Order
Things. Advances in Neural Information Processing Systems, 1999
[35]. Werner Dubitzky, Martin Granzow, C. Stephen Downes, Daniel Berrar,
“Pratical approach to Microarray Analysis”, Oxford University Press, 1999
[36]. Zhong Guan and Hongyu Zhao, A semiparametric approach for marker gene
selection based on gene expression data, Bioinformatics 21(4):529-536 (2005)
Các file đính kèm theo tài liệu này:
- MSc07_Dang_Thanh_Hai_Thesis_English.pdf