Table of Content
Introduction 1
Chapter 1. The Problem of Modeling Text Corpora and Hidden Topic Analysis .3
1.1. Introduction .3
1.2. The Early Methods 5
1.2.1. Latent Semantic Analysis .5
1.2.2. Probabilistic Latent Semantic Analysis 8
1.3. Latent Dirichlet Allocation 11
1.3.1. Generative Model in LDA 12
1.3.2. Likelihood .13
1.3.3. Parameter Estimation and Inference via Gibbs Sampling 14
1.3.4. Applications 17
1.4. Summary 17
Chapter 2. Frameworks of Learning with Hidden Topics 19
2.1. Learning with External Resources: Related Works 19
2.2. General Learning Frameworks 20
2.2.1. Frameworks for Learning with Hidden Topics 20
2.2.2. Large-Scale Web Collections as Universal Dataset .22
2.3. Advantages of the Frameworks .23
2.4. Summary 23
Chapter 3. Topics Analysis of Large-Scale Web Dataset 24
3.1. Some Characteristics of Vietnamese .24
3.1.1. Sound 24
3.1.2. Syllable Structure .26
3.1.3. Vietnamese Word .26
3.2. Preprocessing and Transformation 27
3.2.1. Sentence Segmentation .27
iv
3.2.2. Sentence Tokenization 28
3.2.3. Word Segmentation 28
3.2.4. Filters 28
3.2.5. Remove Non Topic-Oriented Words .28
3.3. Topic Analysis for VnExpress Dataset .29
3.4. Topic Analysis for Vietnamese Wikipedia Dataset 30
3.5. Discussion 31
3.6. Summary 32
Chapter 4. Deployments of General Frameworks 33
4.1. Classification with Hidden Topics 33
4.1.1. Classification Method .33
4.1.2. Experiments 36
4.2. Clustering with Hidden Topics 40
4.2.1. Clustering Method 40
4.2.2. Experiments 45
4.3. Summary 49
Conclusion 50
Achievements throughout the thesis .50
Future Works 50
References 52
Vietnamese References 52
English References .52
Appendix: Some Clustering Results .56
                
              
                                            
                                
            
 
            
                
67 trang | 
Chia sẻ: maiphuongtl | Lượt xem: 2284 | Lượt tải: 1
              
            Bạn đang xem trước 20 trang tài liệu Luận văn Hidden topic discovery toward classification and clustering in vietnamese web documents master thesis hanoi -, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
President) 0.0084 
Đất nước (Country) 0.0070 
Quyền lực (Power) 0.0069 
Dân chủ (Democratic) 0.0068 
Chính quyền (Government)0.0067
Ủng hộ (Support) 0.0065 
Chế độ (System) 0.0063 
Kiểm soát (Control) 0.0058 
Lãnh thổ (Territory) 0.0058 
Liên bang (Federal) 0.0051
Động vật (Animal) 0.0220 
Chim (Bird) 0.0146 
Lớp (Class) 0.0123 
Cá sấu (Crocodiles) 0.0116 
Côn trùng (Insect) 0.0113 
Trứng (Eggs) 0.0093 
Cánh (Wing) 0.0092 
Vây (Fin) 0.0077 
Xương (Bone) 0.0075 
Phân loại (Classify) 0.0054 
Môi trường (Environment)0.0049
Xương sống (Spine) 0.0049
Topic 8 Topic 9 Topic 17 
Nguyên tố (Element) 0.0383 
Nguyên tử (Atom) 0.0174 
Hợp chất (Compound) 0.0172 
Hóa học (Chemical) 0.0154 
Đồng vị (Isotope) 0.0149 
Kim loại (Metal) 0.0148 
Hidro (Hidro) 0.0142 
Phản ứng (Reaction) 0.0123 
Phóng xạ (Radioactivity) 0.0092 
Tuần hoàn (Circulation) 0.0086 
Hạt nhân (Nuclear) 0.0078 
Điện tử (Electronics) 0.0076 
Trang (page) 0.0490 
Web (Web) 0.0189 
Google (Google) 0.0143 
Thông tin (information) 0.0113 
Quảng cáo(advertisement)0.0065
Người dùng(user) 0.0058 
Yahoo (Yahoo) 0.0054 
Internet (Internet) 0.0051 
Cơ sở dữ liệu (database) 0.0044 
Rss (RSS) 0.0041 
HTML (html) 0.0039 
Dữ liệu (data) 0.0038
Lực (Force) 0.0487 
Chuyển động (Move) 0.0323 
Định luật (Law) 0.0289 
Khối lượng (Mass) 0.0203 
Quy chiếu (Reference) 0.0180 
Vận tốc (Velocity) 0.0179 
Quán tính (Inertia) 0.0173 
Vật thể (Object) 0.0165 
Newton (Newton) 0.0150 
Cơ học (Mechanics) 0.0149 
Hấp dẫn (Attractive) 0.0121 
Tác động (Influence) 0.0114
3.5. Discussion 
The hidden topics analysis using LDA for both VnExpress and Vietnamese Wikipedia 
datasets have shown satisfactory results. While VnExpress dataset is more suitable for 
daily life topic analysis, Vietnamese Wikipedia dataset is good for scientific topic 
32 
modeling. The decision of which one is suitable for a task depends much on its domain of 
application. 
From experiments, it can be seen that the number of topics should be appropriate to the 
nature of dataset and the domain of application. If we choose a large number of topics, the 
analysis process can generate a lot of topics which are too close (in the semantic) to each 
others. On the other hand, if we assign a small number of topics, the results can be too 
common. Hence, the learning process can benefits less from this topic information. 
When conducting topic analysis, one should consider data very carefully. Preprocessing 
and transformation are important steps because noise words can cause negative effects. In 
Vietnamese, focus should be made on word segmentation, stop words filter. Also, 
common personal names in Vietnamese should be removed. In other cases, it is necessary 
to either remove all Vietnamese sentences written without tones (this writing style is quite 
often in online data in Vietnamese) or do tone recovery for them. Other considerations 
also should be made for Vietnamese Identification or Encoding conversions, etc., due to 
the complex variety of online data. 
3.6. Summary 
This chapter summarized major issues for topics analysis of 2 specific datasets in 
Vietnamese. We first reviewed some characteristics in Vietnamese. These considerations 
are significant for dataset preprocessing and transformation in the subsequent processes. 
We then described each step of preprocessing and transforming data. Significant notes, 
including specific characteristics of Vietnamese, are also highlighted. In the last part, we 
demonstrated the results from topics analysis using LDA for some dataset in Vietnamese. 
The results showed that LDA is a potential method for topics analysis in Vietnamese. 
33 
Chapter 4. Deployments of General Frameworks 
This chapter goes further into details of the deployments of general frameworks for the 
two tasks: classification and clustering for Vietnamese Web Search Results. Evaluation 
and Analysis for our proposals are also considered in the next subsections. 
4.1. Classification with Hidden Topics 
4.1.1. Classification Method 
Figure 4.1. Classification with VnExpress topics 
The objective of classification is to automatically categorize new coming documents into 
one of k classes. Given a moderate training dataset, an estimated topic model and k 
classes, we would like to build a classifier based on the framework in Figure 4.1. Here, 
we use the model estimated from VnExpress dataset with LDA (see section 3.3. for more 
details). In the following subsections, we will discuss more about important issues of this 
deployment. 
a. Data Description 
For training and testing data, we first submit queries to Google and get results through 
Google API [19]. The number of query phrases and snippets in each train and test dataset 
are shown in Table 4.1 Google search results as training and testing dataset. 
The search phrases for training and test data are designed to be exclusive. Note that, the 
training and testing data here are designed to be as exclusive as possible. 
b. Combining Data with Hidden Topics 
The outputs of topic inference for train/new data are topic distributions, each of which 
corresponds to one snippet. We now have to combine each snippet with its hidden topics. 
34 
This can be done by a simple procedure in which the occurrence frequency of a topic in 
the combination depends on its probability. For example: a topic with probability greater 
than 0.03 and less than 0.05 have 2 occurrences, while a topic with probability less than 
0.01 is not included in the combination. One demonstrated example is shown in Figure 
4.2. 
Table 4.1 Google search results as training and testing dataset. 
The search phrases for training and test data are designed to be exclusive 
 Training dataset Testing dataset 
Domains #phrases #snippets #phrases #snippets 
Business 50 1.479 9 270 
Culture-Arts 49 1.350 10 285 
Health 45 1.311 8 240 
Laws 52 1.558 10 300 
Politics 32 957 9 270 
Science –
Education 
41 1.229 9 259 
Life-Society 19 552 8 240 
Sports 45 1.267 9 223 
Technologies 51 1.482 9 270 
c. Maximum Entropy Classifier 
The motivating idea behind maximum entropy [34][35] is that one should prefer the most 
uniform models that also satisfy any given constraints. For example, consider a four-class 
text classification task where we told only that on average 40% documents with the word 
“professor” in them are in the faculty class. Intuitively, when given a document with 
“professor” in it, we would say it has a 40% chance of being a faculty document, and a 
20% chance for each of the other three classes. If a document does not have “professor” 
we would guess the uniform class distribution, 25% each. This model is exactly the 
maximum entropy model that conforms to our known constraint. 
Although maximum entropy can be used to estimate any probability distribution, we only 
consider here the classification task; thus we limit the problem to learning conditional 
distributions from labeled training data. Specifically, we would like to learn the 
conditional distribution of the class label given a document. 
35 
Figure 4.2 Combination of one snippet with its topics: an example 
Constraints and Features 
In maximum entropy, training data is used to set constraints on the conditional 
distribution. Each constraint shows a characteristic of the training data and the class 
should be present in the learned distribution. Any real-valued function of the document 
and the class can be a feature: . Maximum Entropy enables us to restrict the model 
distribution to have the same expected value for this feature as seen in the training data, 
. Thus, we stipulate that the learned conditional distribution (here, c stands for 
class, and d represents document) must have the below form: 
),( cdfi
D )|( dcP
( ) ( )⎟⎟⎠
⎞
⎜⎜⎝
⎛= ∑
i
ii cdfdZ
dcP ,exp1)|( λ (4.1) 
Where each is a feature, ( cdf i , ) iλ is a parameter which needs to be estimated and ( )dZ is 
simply the normalizing factor to ensure a proper probability: ( ) ( )∑ ∑=
c c
ii cdfdZ ,exp λ
There are several methods for estimating maximum entropy model from training data 
such as IIS (improved iterative scaling), GIS, L-BFGS, and so forth. 
36 
Maximum Entropy for Classification 
In order to apply maximum entropy, we need to select a set of features. For this work, we 
use words in documents as our features. More specifically, for each word-class 
combination, we instantiate a feature as: 
⎩⎨
⎧ ==
otherwise 0
 contains d and ' if 1 
),(',
wcc
cdf cw 
Here, c’ is a class, w is a specific word, and d is current document. This feature will check 
whether “this document d contains the word w and belongs to the class c’ ”. The predicate 
which states that “this document d contains the word w” is called the “context predicate” 
of the feature. 
4.1.2. Experiments 
a. Experimental Settings 
For all the experiments, we based on hidden topics analysis with LDA as described in the 
previous chapter. We then conduct several experiments: one for learning without hidden 
topics and the others for learning with different numbers of topic models of the 
VnExpress dataset which are generated by doing topic analysis for VnExpress dataset 
with 60, 80, 100, 120, 140 and 160 topics. 
For learning maximum entropy classifier, we use JMaxent [39] and set the context 
predicate and feature thresholds to be zero; the other parameters are set at defaults. 
b. Evaluation 
Traditional Classification use Precision, Recall and F-1 measure to evaluate the 
performance of the system. The meanings of such measures are given below: 
Precision of a classifier with respect to a class is the fraction of the number of examples 
which are correctly categorized into that class over the number of examples which are 
classified into that class: 
c class as classified examples#
c class as classifiedcorrectly examples#Precisionc = 
Recall of a classifier with respect to a class is the fraction of the number of examples 
which are correctly categorized into that class over the number of examples which belong 
to that class (by human assignment): 
c class tobelong examples all#
c class as classifiedcorrectly examples#Recallc = 
37 
To measure the performance of a classifier, it is usually used F-1 measure which is the 
harmonic mean of precision and recall: 
cc
cc
c RecallPrecision
RecallPrecision2
 1-F +
××= 
c. Experimental Results and Discussion 
62
64
66
68
70
72
74
Wi
tho
ut 
To
pic
s
60
 to
pic
s
80
 to
pic
s
10
0 t
op
ics
12
0 t
op
ics
14
0 t
op
ics
16
0 t
op
ics
Precision Recall F1-measure
Figure 4.3. Learning with different topic models of VnExpress dataset; and the baseline (without topics) 
62.02
70.86
72.45
72.25
71.91 72.26
65.94
66.41
67.08
66
56
58
60
62
64
66
68
70
72
74
1.3
5
2.7 4.0
5
5.3
52
6.5
52
7.7
52
8.8
59
9.9
09
10
.72
11
.13
Size of labeled training data (x1000 examples)
F1
 m
ea
su
re
 (%
)
w ith hidden topic inference
baseline (w ithout topics)
Figure 4.4. Test-out-of train with increasing numbers of training examples. Here, the number of topics is set at 
60topics 
38 
Table 4.2. Experimental results of baseline (learning without topics) 
Class Human Model Match Pre. Rec. F1-score 
Business 270 347 203 58.50 75.19 65.80 
Culture-Arts 285 260 183 70.38 64.21 67.16 
Health 240 275 179 65.09 74.58 69.51 
Laws 300 374 246 65.78 82.00 73.00 
Politics 270 244 192 78.69 71.11 74.71 
Science 259 187 121 64.71 46.72 54.26 
Society 240 155 106 68.39 44.17 53.67 
Sports 223 230 175 76.09 78.48 77.26 
Technologies 270 285 164 57.54 60.74 59.10 
Avg.1 67.24 66.35 66.79 
Avg.2 2357 2357 1569 66.57 66.57 66.57 
Table 4.3. Experimental results of learning with 60 topics of VnExpress dataset 
Class Human Model Match Pre. Rec. F1-score 
Business 270 275 197 71.64 72.96 72.29 
Culture-Arts 285 340 227 66.76 79.65 72.64 
Health 240 256 186 72.66 77.5 75 
Laws 300 386 252 65.28 84 73.47 
Politics 270 242 206 85.12 76.3 80.47 
Science 259 274 177 64.6 68.34 66.42 
Society 240 124 97 78.23 40.42 53.3 
Sports 223 205 173 84.39 77.58 80.84 
Technologies 270 255 180 70.59 66.67 68.57 
Avg.1 73.25 71.49 72.36 
Avg.2 2357 2357 1695 71.91 71.91 71.91 
39 
50
55
60
65
70
75
80
85
90
95
100
Bu
sin
es
s
Cu
ltu
re
-A
rts
He
alt
h
La
ws
Po
liti
cs
Sc
ien
ce
So
cie
ty
Sp
or
ts
Te
ch
no
log
ies
Av
er
ag
e
F1
-M
ea
su
re
Without Topics With Hidden Topics Inference
Figure 4.5 F1-Measure for classes and average (over all classes) in learning with 60 topics 
Figure 4.3 shows the results of learning with different settings (without topics, with 60, 
80, 100, 120, 140 topics) among which learning with 60 topics got the highest F-1 
measure (72.91% in comparison with 66.57% in baseline – see Table 4.2 and Table 4.3). 
When the number of topics increase, the F-1 measures vary around 70-71% (learning with 
100, 120, 140 topics). This shows that learning with hidden topics does improve the 
performance of classifier no matter how many numbers of topics is chosen. 
Figure 4.4 depicts the results of learning with 60 topics and different number of training 
examples. Because the testing dataset and training dataset are relatively exclusive, the 
performance is not always improved when the training size increases. In any cases, the 
results for learning with topics are always better than learning without topics. Even with 
little training dataset (1300 examples), the F-1 measure of learning with topics is quite 
good (70.68%). Also, the variation of F-1 measure in experiments with topics (2% - from 
70 to 72%) is smaller than one without topics (8% - from 62 to 66%). From these 
observations, we see that our method does take effects even with little learning data. 
40 
4.2. Clustering with Hidden Topics 
4.2.1. Clustering Method 
Figure 4.6. Clustering with Hidden Topics from VnExpress and Wikipedia data 
Web search clustering is a solution to reorganize search results in a more convenient way 
for users. For example, when a user submits query “jaguar” into Google and wants to get 
search results related to “big cats”, s/he need go to the 10th, 11th, 32nd, and 71st results. If 
there is a group named “big cats”, the four relevant results can be ranked high in the 
corresponding list. Among previous works, the noticeable and most successful clustering 
system is Vivisimo [49] in which the techniques are kept unknown. This section considers 
deployment issues of clustering web search results with hidden topics in Vietnamese. 
a. Topic Inference and Similarity 
For each snippet, after topic inference we get the probability distribution of topics over 
the snippet. From that topic distribution for each snippet, we construct the topic vector for 
that snippet as following: the weight of a topic will be assigned zero if its probability less 
than a predefined ‘cutt-off threshold’, and be assigned the value of its probability 
otherwise. Suppose that weights for words in the term vector of the snippet have been 
normalized in some way (tf; tf-idf; etc), the combined vector corresponding to snippet i-
th has the following form: 
{ }||121 ,...,,,...,, VKi wwtttd = (4.2) 
41 
Here, ti is the weight for topic i-th in K analyzed topics (K is a constant parameter of 
LDA); wi is the weight for word/term i-th in vocabulary V of all snippets. 
Next, for 2 snippets i-th and j-th, we use the cosine similarity to measure the similarities 
between topic-parts as well as between term-parts of the 2 vectors. 
∑∑
∏
==
=
×
=−
K
k
kj
K
k
ki
K
k
kjki
ji
tt
tt
partstopicsim
1
2
,
1
2
,
1
,,
, )( 
∑∑
∏
==
=
×
=−
||
1
2
,
||
1
2
,
||
1
,,
, )( V
t
tj
V
t
ti
V
t
tjti
ji
ww
ww
partswordsim 
We later propose the following combination to measure the final similarity between them: 
( ) parts)-word(1)partstopic(),( simsimddsim ji ×−+−×= λλ (4.3) 
Here, λ is a mixture constant. If 0=λ , we calculate the similarity without the support of 
hidden topics. If 1=λ , we measure the similarity between 2 snippets from hidden topic 
distributions without concerning words in snippets. 
b. Agglomerative Hierarchical Clustering 
Hierarchical clustering [48] builds (agglomerative), or breaks up (divisible), a hierarchy 
of clusters. The traditional representation of this hierarchy is a tree called dendrogram. 
Agglomerative algorithms begin with each element as a separate cluster and merge them 
into successively larger clusters. 
Cutting the tree at a given height will give a clustering at a selected precision. In the 
example in Figure 4.7, cutting after the second row will generate clusters {a} {b c} {d e} 
{f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser 
clustering, with a smaller number of larger clusters. 
The method builds the hierarchy from the individual elements by progressively merging 
clusters. In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step 
is to determine which elements to merge in a cluster. Usually, we want to take the two 
closest elements, according to the chosen similarity. 
Optionally, one can construct a similarity matrix at this stage, where the number in the i-
th row j-th column is the similarity/distance between the i-th and j-th elements. Then, as 
42 
clustering progresses, rows and columns are merged as the clusters are merged and the 
similarities updated. This is a common way to implement this type of clustering, and has 
the benefit of catching distances between clusters. 
Figure 4.7. Dendrogram in Agglomerative Hierarchical Clustering 
Suppose that we have merged the two closest elements b and c, we now have the 
following clusters {a},{b,c},{d},{e}, and {f}, and want to merge them further. To do that, 
we need to measure the similarity/distance between {a} and {b c}, or generally similarity 
between two clusters. Usually the similarity between two clusters A and B can be 
calculated as one of the following: 
- The minimum similarity between elements of each cluster (also called complete 
linkage clustering): 
( ){ }ByAxyxsim ∈∈ ,:,min 
- The maximum similarity between elements of each cluster (also called single 
linkage clustering): 
( ){ }ByAxyxsim ∈∈ ,:,max 
- The mean similarity between elements of each clusters (also called average linkage 
clustering): 
( )∑∑
∈ ∈Ax By
yxsim
BA
,
||||
1 
Each agglomerative occurs at a smaller similarity between clusters than the previous 
agglomeration, and one can decide to stop clustering either when the clusters are too far 
apart to be merged (similarity criterion) or when there is a sufficiently small number of 
clusters (number criterion). 
43 
c. Labeling Clusters 
Given a set of clusters for a text collection, our goal is to generate understandable 
semantic labels for each cluster. We now state the problem of cluster labeling similarly to 
the problem of “topic labeling problem” [27] as follows: 
Definition 1: A cluster ( ) in a text collection has a set of “close” snippets (here, we 
consider snippets are small documents), each cluster is characterized by an “expected 
topic distribution” 
Cc∈
cθ which is the average of topic distributions of all snippets in the 
cluster. 
Definition 2: A “cluster label” or a “label” l for a cluster Cc∈ is a sequence of words 
which is semantically meaningful and covers the latent meaning of cθ . Words, phrases, 
and sentences are all valid labels under this definition. 
Definition 3 (Relevance Score) The relevance score of a label to a cluster cθ , ( )cls θ, , 
measures the semantic similarity between the label and the topic model. Given that both 
 and are meaningful candidate labels, is a better label for c than if 1l 2l 1l 2l
( ) ( cc lsls )θθ ,, 21 > 
With these definitions, the problem of cluster labeling can be defined as follows: Let 
be a set of N clusters, and },...,,{ 21 NcccC = { }siiii lllL ,2,1, ,...,,= be the set of candidate 
cluster labels for the cluster number i in C. Our goal is to select the most likely label for 
each cluster. 
Candidate Label Generation 
Candidate label is the first phrase to label clusters. In this work, we generate candidates 
based on “Ngram Testing” which extract meaningful phrases from word n-grams based 
on statistical tests. There are many methods for testing whether an n-gram is meaningful 
collocation/phrase or just co-occur by accidence. Some methods depend on statistical 
measures such as mutual information. Others rely on hypothesis testing techniques. The 
null hypothesis usually assumes that “the words in an n-gram are independent”, and 
different test statistics have been proposed to test the significance of violating the null 
hypothesis. 
For the experiments, we use the n-gram hypothesis testing (n <=2) which depend on chi-
square test [11] to find out meaningful phrases. In other words, there are two types of 
label candidates: (1) non-stop words (1-gram); and (2) a phrase of 2 consecutive words 
(2-grams) with its chi-square value calculated from a large collection of text is greater 
than a threshold - the “colocThreshold”. 
44 
Table 4.4. Some collocations with highest values of chi-square statistic 
Collocation (Meaning in Enlish) Chi-square value 
TP HCM (HCM city) 2.098409912555148E11 
Monte Carlo (Monte Carlo) 2.3750623868571806E9 
Thuần_phong mỹ_tục (Habits and Customs) 8.404755120045843E8 
Bin Laden (Bin Laden) 5.938943787195972E8 
Bộ Vi_xử_lý (Center Processing) 3.5782968749839115E8 
Thép_miền_nam Cảng_Sài_gòn (a football club) 2.5598593044043452E8 
Trận chung_kết (Final Match) 1.939850017618072E8 
Đất_khách quê_người (Forein Land) 1.8430912500609657E8 
Vạn_lý trường_thành (the Great Wall of China) 1.6699845099865612E8 
Đi_tắt đón_đầu (Take a short-cut, Wait in front) 1.0498738800702788E8 
Xướng_ca vô_loài 1.0469589600052954E8 
Ổ cứng (Hard Disk) 9.693021145946936E7 
Sao_mai Điểm_hạn (a music competition) 8.833816801460913E7 
Bảng xếp_hạng (Ranking Table) 8.55072554114269E7 
Sơ_yếu lý_lịch (Curiculum Vitae) 8.152866670394194E7 
Vốn điều_lệ (Charter Capital) 5.578214903954915E7 
Xứ_sở sương_mù (England) 4.9596802405895464E7 
Windows XP (Windows XP) 4.8020442441390194E7 
Thụ_tinh ống_nghiệm (Test-tube Fertilization) 4.750102933435161E7 
Outlook Express (Outlook Express) 3.490459668749844E7 
Công_nghệ thông_tin (Information Technology) 1587584.1576983468 
Hệ_thống thông_tin (Information System) 19716.68246929993 
Silicon Valley (Silicon Valley) 1589327.942940336 
Relevance Score 
We borrowed the ideas of simple score and inter-cluster score from [27]. Simple score is 
the relevance of a label and a specific cluster without concerning the other clusters. Inter-
cluster score of a label and a cluster, on the other hand, look at not only the interesting 
cluster but also other clusters. As a result, the labels chosen using inter-cluster score 
discriminate clusters better than simple score. 
In order to get relevance between a label candidate (l) and a cluster (c) using simple 
score, we use 3 types of features including the topic similarity (topsim) between topic-
distribution of the label candidate and the “expected topic distribution” of the cluster, the 
length of candidate (lenl), number of snippets in the cluster c containing this phrase 
(cdfl,c). More concretely, given one candidate label lwwwl ...21= , we first inference the 
45 
topic distribution lθ for the label also by using the estimated model of the universal 
dataset. Next, the simple relevance score of l is measured with respect to a cluster cθ by 
using cosine similarity for topic similarity [48]: 
 (4.4) lclclc lencdflsplscore θ ×+×+×α θ βθ= ,|(cosine),( ) γ
A good cluster label is not only relevant to the current cluster but also help to distinguish 
this cluster to another. So, it is very useful to penalize the reference score of a label with 
respect to the current cluster by the reference scores of that label to other clusters 
( and ). Thus, we get the inter-cluster scoring function as follows: 
c 'c
Cc∈' cc ≠'
∑
≠∈
×−=
ccCc
ccc lsplscorelsplscorelscore
','
' ),(),(),( θμθθ (4.5) 
The candidate labels of a cluster are sorted - in descending order - by its relevance and the 
4 most relevant candidates are then chosen as labels for the cluster. 
d. Ranking within Cluster 
We reorganize the order of documents in each cluster by the relevance between its topic 
distribution and the “expected topic distribution” of the cluster. If the relevant measures 
of 2 snippets within one cluster are the same, their old ranks in the complete list 
determined by Google are used to specify the final ranks. In other words, if the 2 snippets 
have the same relevant score, the one with higher old rank has higher rank in the 
considered cluster. 
4.2.2. Experiments 
a. Experimental Settings 
For experiments, we first submit 10 ambiguous queries to Google and get back about 200 
search result snippets [Table 4.5]. The reason why we choose these queries is that they are 
likely to contain multiple sub-topics. Thus, we will benefit more from clustering search 
results. 
Table 4.5. Queries submitted to Google 
Types Queries 
General Terms Sản phẩm (products), thị trường (market), triển lãm (exhibition), công 
nghệ (technology), đầu tư (investment), hàng hóa (goods) 
Ambiguous Terms Ma trận (matrix), tài khoản (account), hoa hồng (rose/money), ngôi 
sao (star) 
46 
For each query, we cluster search results using HAC and hidden topics which are 
discovered from the collection of Vnexpress and Wikipedia data (see previous chapter) 
with 200 topics. Parameters for clustering are shown as in the following table. 
Table 4.6. Parameters for clustering web search results 
Parameters Meaning Values 
Normalized Method for constructing word-part vector of the snippet TF 
Similarity between 
Clusters 
How to calculate similarity between 2 clusters based on 
similarites between pair of snippets 
Average 
Linkage 
Cut-off A topic with its probability less than this value will be 
weighted as zero in topic-part vector of the snippet 
0.01 
Lamda Mixture of topic-part and word-part in the vector of the 
snippet 
0.4 
mergeThreshold The smallest similarity with which two clusters can be 
merged (similarity criterion) [see 4.2.1. b. ] 
0.13 
Anpha Weight of topic-similarity feature in the simple scoring 
method [Eq.4.4] 
10 
Beta Weight of the “cdf” feature in the simple scoring method 
[Eq.4.4] 
2 
Gama Weight of the “len” feature in the simple scoring method 
[Eq.4.4] 
2 
Muy Parameter in the inter-cluster scoring method [Eq.4.5] 0.35 
colocThreshold The smallest chi-square value that 2-grams can be get to form 
a collocation (a meaningful phrase for labeling) 
2500.0 
b. Evaluation 
In order to evaluate the clustering method, for each query, we specify “good clusters”, 
which are clusters with snippets telling us about a coherent topic. We count the number of 
snippets in good clusters for each query and calculate the ‘coverage’ as following: 
query for this snippets ofnumber 
clusters" good" selectedover snippets ofnumber cov =erage (4.6) 
For each good cluster of some query, we evaluate both the quality of the cluster as well as 
the ranking policy by calculating P@5, P@10 and P@20 which are the precision at top 5, 
top 10, and top 20 snippets respectively. 
c. Experimental Results and Discussions 
47 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
công
nghệ
đầu tư hàng
hóa
hoa
hồng
Ma trận Ngôi sao sản
phẩm
tài
khoản
thị
trường
triển lãm
queries
pr
ec
is
io
n
P@5 P@10 P@20
Figure 4.8 Precision of top 5 (and 10, 20) in best clusters for each query 
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
công
nghệ
đầu tư hàng
hóa
hoa
hồng
ma trận ngôi
sao
sản
phẩm
tài
khoản
thị
trường
triển
lãm
queries
co
ve
ra
ge
(%
)
Top 5 Coverage Top 10 Coverage
Figure 4.9 Coverage of the top 5 (and 10) good clusters for each query 
Figure 4.8 shows the precision of top 5 (and 10, 20) in best clusters for each query. 
Although the performance depends heavily on search results returned by the Web search 
engine, the overall quality is satisfactory (the precision is above 80% on average). For 
some queries such as “công nghệ” (technology), the returned snippets focus mostly on the 
topic of “information technology”, thus making the clustering system depends heavily on 
the word similarities to determine clusters. As a result, the clustering quality is not as 
good as for other queries. For queries such as “ma trận” (matrix), the search results vary 
in multiple domains (movie, game, mathematics, technology – like matrix in cameras) 
making topic information become really beneficial, the performance for them is quite 
good (for “ma trận”, P@5, P@10 and P@20 are of 100%, 98%, and 96% respectively). 
48 
The coverage of the best 5 (and 10) clusters for each query is demonstrated in the figure 
4.9. From this figure, we can see that the coverage of 10 best clusters for each query is 
around 40 – 50 % (of about 250 snippets). This means that these clusters can help users to 
navigate efficiently through about 10 pages returned from Google (suppose that the 
number of snippets per page is 10 – the default number of snippets per page of Google). 
0
0.5
1
1.5
2
2.5
3
3.5
Phát
hiện
hành
tinh
mặt trời nhà
khoa
học
quỹ
đạo
sao Hải
vương
trắng trạng
thái
vũ trụ vật
chất
lùn khối
lượng
Snippet 3
Snippet 2
Snippet 1
\
0
5
10
15
20
25
30
35
40
Topic 107 Topic 141 Topic 9 Topic 185
W
ei
gh
t (
sc
al
ed
 a
nd
 ro
un
de
d)
Snippet 3
Snippet 2
Snippet 1
Topic 107 
(astronomy):mặt_trời 
trái_đất hành_tinh 
quỹ_đạo vệ_tinh 
quan_sát quay 
mặt_trăng ngôi_sao 
vũ_trụ vật_thể 
thiên_thể thiên_văn 
khối_lượng 
Although there is still a lot of work to verify our method, these results have partly proved 
its effect. The most advantage of our clustering method is that not only snippets which 
share a lot of word choices are considered similar, but also those sharing the hidden 
topics. As a result, it goes beyond the limitations of different word choices (see Figure 
4.10). 
Snippet 3 = 
Bên_trong các sao 
lùn trắng vật_chất ở 
trạng_thái siêu_đặc 
như_vậy 
Snippet 2 = Sao lùn 
cực nhẹ là những có 
khối_lượng nhỏ hơn 
0,3 lần khối_lượng 
mặt_trời Trong 
vùng lân_cận 
mặt_trời hiện còn 
vô_số sao lùn có 
khối_lượng cực 
Snippet1 = Phát 
hiện 28 hành_tinh 
mới ngoài hệ 
mặt_trời Tin_tức 
sự_kiện Trong_số 
28 hành_tinh mới 
các nhà_khoa_học 
phát_hiện một 
hành_tinh 
ố
Figure 4.10. Word and Topic sharing among 3 snippets of the same cluster about astronomy 
49 
4.3. Summary 
This chapter describes details of the deployments of general frameworks in classification 
and clustering in Vietnamese. From experiments, good results have been observed in both 
tasks. We can get the improvement of about 8% for the task of classification with sparse 
data. The topic-oriented clustering method has shown its efficiency in both improving the 
quality of clustering search results, labeling and re-ranking clusters. These results can be 
seen as practical evidences for our arguments in the previous chapters. 
50 
Conclusion 
Achievements throughout the thesis 
The main contributions of this thesis lie in the following folds: 
- Chapter 1 summarize some major text modeling and hidden topic models with 
particular attention to LDA which has recently shown its success in many 
applications such as entity resolution, classification, feature selection and so on. 
These models are milestones for our proposals in the subsequent chapters. 
- In chapter 2, two general frameworks have been proposed for learning with the 
support of hidden topics. The main motivation is how to gain benefits from huge 
sources of online data in order to enhance quality of the Text/Web clustering and 
classification. Unlike previous studies of learning with external resources, we 
approach this issue from the point of view of text/Web data analysis that is based 
on recently successful latent topic analysis models like LSA, pLSA, and LDA. The 
underlying idea of the framework is that for each learning task, we collect a very 
large external data collection called “universal dataset”, and then build a learner on 
both the learning data and the rich set of hidden topics discovered from that 
collection. 
- In chapter 3, we discuss important issues and results of topic analysis with LDA 
for two datasets: VnExpress (199MB) and Wikipedia (270M). Significant 
considerations about preprocessing and transformation in Vietnamese as well as 
topic analysis have been highlighted. From the experimental results, we see that 
LDA is a suitable method for topic analysis in Vietnamese. 
- Chapter 4 describes two deployments of general frameworks for 2 tasks which are 
classifying and clustering search results in Vietnamese. Significant improvement 
for classification and clustering has shown the success of our proposed methods. 
Future Works 
Topic Analysis is attractive to many researchers because of its widespread applications in 
various areas as well as it potentially contains different research trends. In the future, the 
following research directions could be taken into considerations: 
- Deployments of general frameworks for Page Rank and Summary in Search 
Engine. In the task of Page rank, we consider a query as a short document and infer 
topic distribution for it. Based on the topic distributions of returned pages, we then 
order them with respect to their relevance to the topic distribution of the query, 
51 
thus providing user with topic-oriented ranking. In the Summary problem, for each 
page result, we can take sentences which are closest in topic-distribution with the 
query and contain keywords as the summary for that page. For the 
implementations, the topic inference can be done offline for all the web pages 
stored in the search engine so that reduce the online computations. 
- Tracking Online News over time using Dynamic Topic Models: DTM is an 
extension to Latent Dirichlet Allocation and has proposed by Blei et. al. (2006). 
This is a useful tool for tracking and visualizing the development of topics over 
time. One application of this is to track news about business so that one can answer 
the questions like “during which time, attentions will be paid to some business 
field”. This can help much for stockbroker to make their investment decisions. 
52 
References 
Vietnamese References 
[1]. Mai, N.C., Vu, D.N., Hoang, T.P. (1997), “Cơ sở ngôn ngữ học và tiếng Việt”, Nhà 
xuất bản Giáo dục 
English References 
[2]. Andrieu, C., Freitas, N.D., Doucet, A. and M.I. Jordan (2003), “An Introduction to 
MCMC for Machine Learning”, Machine Learning Journal, pp. 5- 43. 
[3]. Banerjee, S., Ramanathan, K., and Gupta, A (2007), “Clustering Short Texts Using 
Wikipedia”, In Proceedings of ACL. 
[4]. Bhattacharya, I. and Getoor, L. (2006), “A Latent Dirichlet Allocation Model for 
Entity Resolution”, In Proceedings of 6th SIAM Conference on Data Mining, 
Maryland, USA. 
[5]. Blei, D.M., Ng, A.Y. and Jornal, M.I. (2003), “Latent Dirichlet Allocation”, Journal 
of Machine Learning Research 3, pp.993-1022 
[6]. Blei, D. and Jordan, M. (2003), “Modeling annotated data”, In Proceedings of the 
26th annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval 127–134. ACM Press, New York, NY. 
[7]. Blei, M. and Lafferty, J. (2006), “Dynamic Topic Models”, In Proceedings of the 
23rd International Conference on Machine Learning, Pittsburgh, PA. 
[8]. Blei, M. and Lafferty, J. (2007), “A Correlated Topic Model of Science”, The 
Annals of Applied Statistics. 1, pp. 17-35 
[9]. Blum, A. And Mitchell, T. (1998), “Combining Labeled and Unlabeled Data with 
Co-training”. In Proceedings of COLT. 
[10]. Bollegala, D., Matsuo, Y., And Ishizuka, M. (2007), “Measuring Semantic 
Similarity between Words using Web Search Engines”. In Proceedings of WWW. 
[11]. Christopher, D.M., Hinrich, S. (Jun, 1999), Foundations of Statistical Natural 
Language Processing. 
[12]. Chuang, S.L., and Chien, L.F. (2005), “Taxonomy Generation for Text Segments: a 
Practical Web-based Approach”, ACM Transactions on Information Systems. 23, 
pp.363-386. 
[13]. Deerwester, S., Furnas, G.W., Landauer, T.K., and Harshman, R.(1990), “Indexing 
by Latent Semantic Analysis”, Journal of the American Society for Information 
Science. 41, 391-407. 
[14]. Dhillon, I., Mallela, S., And Kumar, R. (2002), “Enhanced Word Clustering for 
Hierarchical Text Classification”, In Proceedings of ACM SIGKDD 
53 
[15]. Erosheva, E., Fienberg, S. and Lafferty, J. (2004). Mixed-membership models of 
scientific publications. Proc. Natl. Acad. Sci. USA 97 11885–11892. 
[16]. Gabrilovich, E. and Markovitch, S. (2007), “Computing Semantic Relatedness using 
Wikipedia-based Explicit Semantic Analysis”, In Proceedings of IJCAI 
[17]. Girolami, M. and Kaban, A. (2004). “Simplicial mixtures of Markov chains: 
Distributed modelling of dynamic user profiles”. In Advances in Neural Information 
Procesing Systems 16 9–16. MIT Press, Cambridge, MA. 
[18]. Griffiths, T., Steyvers, M., Blei, D. and Tenenbaum, J. (2005), “Integrating topics 
and syntax”, In Advances in Neural Information Processing Systems 17 537–544, 
MIT Press, Cambridge, MA. 
[19]. Google Search, 2007.  
[20]. Heinrich, G., “Parameter Estimation for Text Analysis”, Technical Report. 
[21]. Hofmann, T., “Probabilistic Latent Semantic Analysis”, In Proceedings of UAI 
[22]. Hofmann, T., (2001), “Unsupervised Learning by Probabilistic Latent Semantic 
Analysis”, Machine Learning. 42, pp. 177-196 
[23]. Lawrie, D. J. and Croft, W.B. (2003), “Generating Hierarchical Summaries for Web 
Searches”, In Proceedings of ACM SIGIR. 
[24]. Letsche, T. A. and Berry, M. W. (1997), “Large-Scale Information Retrieval with 
Latent Semantic Analysis”, Information Science. 100, pp. 105-137 
[25]. Liu, B., Chin, C. W., and Ng, H. T., “Mining Topic-Specific Concepts and 
Definitions on the Web”, In Proceedings of WWW 
[26]. Latent Semantic Analysis,  
[27]. Mei, Q., Shen, X., And, Zhai, C., “Automatic Labeling of Multinomial Topic 
Models”. In Proceedings of ACM SIGKDD, 2007 
[28]. Modha, D.S. and Spangler, W.S., “Clustering Hypertext with Applications to Web 
Searching”, In Proceedings of the 11th ACM on Hypertext and Hypermedia 
[29]. Mccallum, A., Corrada-emmanuel, A. And Wang, X. (2004). “The author–
recipient–topic model for topic and role discovery in social networks: Experiments 
with Enron and academic email”, Technical report, Univ. Massachusetts, Amherst. 
[30]. Nigram, K., McCallum, A., Thrun, S., and Mitchell, T., “Text Classification from 
Labeled and Unlabeled Documents using EM”, Machine Learning. 39, pp. 103-134 
[31]. Nguyen, C.T., Nguyen, T.K, Phan, X.H., Nguyen, L.M. and Ha, Q.T., “Vietnamese 
Word Segmentation with CRFs and SVMs: An Investigation”, In Proceedings of the 
20th Pacific Asia Conference on Language, Information and Compuation 
(PACLIC20), pp.215-222, Wuhan, China, 1-3 November 2006 
[32]. Nguyen, C.T., “JVnSegmenter: A Java-based Vietnamese Word Segmentation 
Tool”,  2007 
54 
[33]. Nguyen, C.T., Tran, T.O., Ha, Q.T., Phan, X.H, “Named Entity Recognition in 
Vietnamese Free-Text and Web Documents Using Conditional Random Fields”, The 
Workshop on Asian Applied NLP and language resource development, Sirindhorn 
International Institute of Technology, Pathumthani, Thailand, March 13, 2007. 
[34]. Nguyen, V.C., Nguyen, T.T.L., Ha, Q.T.., Phan, X.H. (2006), “A Maximum Entropy 
Model for Text Classification”, In Proceeding of International Conference on 
Internet Information Retrieval, pp. 143-149, Korea. 
[35]. Nigam, K., Lafferty, J., McCallum, A. (1999), “Using Maximum Entropy for Text 
Classification”, In Proceeding of the International Joint Conference on Artificial 
Intelligence. 
[36]. Nutch: an open-source search engine,  
[37]. Phan, X.H, “JWebPro: A Java-based Web Processing Toolkit” 
 2007 
[38]. Phan, X.H, “GibbsLDA++: A C/C++ and Gibbs Sampling based Implementation of 
Latent Dirichlet Allocation (LDA)”,  2007 
[39]. Phan, X.H, “JTextPro: A Java-based Text Processing Toolkit”, 
[40]. Papadimitriou, C., Tamaki, H., Raghavan, P., and Vempala, S., “Latent Semantic 
Indexing: A probabilistic analysis”, pages 159-168, 1998 
[41]. Popescul, A., Ungar, L., Pennock, D., and Lawrence, S., (2001) “Probabilistic 
Models for Unified Collaborative and Content-based Recommendation in Sparse-
data environments”, In Uncertainty in Artificial Intelligence, Proceeding of the 
Seventeenth Conference. 
[42]. Rosen-zvi, M., Griffiths, T., Steyvers, M. and Smith, P. (2004), “The author-topic 
model for authors and documents”, In AUAI’04: Proceedings of the 20th Conference 
on Uncertainty in Artificial Intelligence 487–494. AUAI Press, Arlington, VA. 
[43]. Sahami, M., and Heilman, T.D. (2006), “A Web-based Kernel Function for 
Measuring the Similarity of Short Text Snippets”, In Proceedings of WWW 
[44]. Sato, I. and Nakagawa, H., “Knowledge Discovery of Multiple-Topic Document 
using Parametric Mixture Model with Dirichlet Prior”, In Proceedings of ACM 
SIGKDD, 2007 
[45]. Schonofen, P., “Identifying Document Topics using the Wikipedia Category 
Network”, In Proceedings of IEEE/WIC/ACM International Conference on Web 
Intelligence, 2006 
[46]. Sivic, J., Rusell, B., Efros, A., Zisserman, A. and Freeman, W. (2005), “Discovering 
object categories in image collections” Technical report, CSAIL, Massachusetts 
Institute of Technology 
[47]. VnExpress: the online Vietnamese news,  
[48]. Vector Space Model,  
55 
[49]. Vivisimo Web Search,  
[50]. Wikipedia: the free encyclopedia,  2007 
[51]. Zamir, O. and Etzioni, O., “Grouper: a Dynamic clustering interface to Web search 
results”, In Proceedings of WWW. 
[52]. Zeng, H.J., He, Q.C., Chen, Z., Ma, W.Y., And Ma, J., “Learning to Cluster Web 
Search Results”, In Proceedings of ACM SIGIR, 2004 
56 
Appendix: Some Clustering Results 
Clustering results for the query “ma trận” (matrix) 
1 hệ_phương_trình(18) 
tuyến_tính, đại_số, 
định_thức, số_phức, 
bài_toán, thuần_nhất, 
lượng_giác, số_ảo, khả, 
1. Bửu_bối giúp đậu mấy môn toán đại_cương 
phương_pháp tính hŕm ... lấy giới_hạn ma_trận số_thực 
ma_trận số_ảo lượng_giác thực lượng_giác ảo 
2. Đại_số Diễn_Đàn Sinh_Viên Quy_Nhơn Bài_4 Cho 
ma_trận vuông có các phần_tử trên đường_chéo chính là 
2007 các phần_tử Giải hệ_phương_trình tuyến_tính 
thuần_nhất với là ma_trận cột 
3. Đề_cương môn_học Biết cách biểu_diễn các đồng cấu 
bởi ma_trận đồng_thời biết sử_dụng các phép_toán trên 
ma_trận và biết cách giải một hệ_phương_trình đại_số 
tuyến_tính ứng 
2 phim(12) 
bộ phim, diễn_viên, vai 
diễn, điện_ảnh thế_giới, 
ngôi_sao điện_ảnh, phim 
ăn_khách, phim_truyện, 
ngôi_sao, diễn, 
1. phim truyen han_quoc ma tran tai_hien dvd Những 
khách_hàng mua Phim_Truyện Hàn_Quốc Ma_Trận 
Tái_Hiện DVD HQ_Pro cũng mua những món_hàng 
sau_đây 
2. 24h.com.vn Giới_thiệu các ngôi_sao điện_ảnh thế_giới 
và Việt_nam Với vai diễn Neo trong The_Matrix Ma_trận 
1999 Reeves đã đưa bộ phim trở thành_bộ phim 
ăn_khách và được nhiều người săn_lùng 
3. Ma_Trận I Ii Iii Tổng_Hợp 3 Phần Của_Phim VN 
Zoom_forum Phim The_Matrix quả_là một phim_hay 
không_chỉ bởi diễn_viên bởi kỹ_xảo hình_ảnh âm_thanh 
3 game(8) 
game, game trực_tuyến, 
game nhập_vai, 
trò_chơi, game online, 
picachu, phiêu_lưu, tải 
game, giải_trí 
trực_tuyến, chơi game, 
1. Game Online Cuộc Chiến Ma_Trận Cuộc Chiến Ngoài 
Hành_Tinh Cuộc Phiêu_Lưu Của Chó_Con 
Cướp_Biển_Caribê Dap_Tho Đi Tìm Cún_Con 
Lau_Kính 
2. Giới_thiệu các trò_chơi mới hướng_dẫn chơi game 
Thất_bại vẫn tiếp_tục đeo_bám Ma_Trận khi The 
Matrix_Online game nhập_vai trực_tuyến đầy tham_vọng 
3. game_online tải game game_flash game_mini game hay 
game trực_tuyến Tên_Game Ma_trận Description 
Chúng_ta đang ở trong ma_trận hãy tiêu_diệt hết bọn 
nhân_bản nào 
4 led(5) 
led, máy_ảnh, cảm_biến, 
nét, tat ca, kiểu ma_trận, 
linh_kiện, tu van, đo, 
sáng, 
1. ĐIỀU_KHIỂN MA_TRẬN LED VÀ BÀN_PHÍM HEX 
LED MATRIX_ 
2. Nội_dung các con chíp_LED nguyên_vật_liệu đèn 
phát_quang các thiết_bị năng_lượng cao linh_kiện 
điện_tử có độ_phân_giải cao điểm ma_trận 
3. Nikon D2Xs Máy_ảnh chuyên_dụng tốt nhất DIỄN_ĐÀN 
CÔNG_NGHỆ VIỆT_NAM Nikon còn trang_bị mạch đo 
sáng kiểu ma_trận 3D 
5 máy_in(2) 
máy_in, đòi_hỏi, 
ứng_dụng rộng_rãi, hóa 
đơn, http, môi_trường, 
series, tốc_độ, in_nhanh, 
raovat, 
1. MÁY_IN HÓA ĐƠN POSIFLEX_PP 5600_SERIES 
Mua_ban rao_vat Raovat Tốc_độ in_nhanh in theo 
phương_pháp ma_trận điểm 
2. Các máy_in ma_trận dòng_T6212 T6215 và T6218 là 
những máy_in tốc_độ cao phù_hợp trong môi_trường 
đòi_hỏi in_ấn với số_lượng lớn Tốc_độ tương_ứng 
57 
Clustering results for the query “ngôi sao” (star) 
1 điện_ảnh(23), ngôi_sao 
điện_ảnh, điện_ảnh 
thế_giới, phim, bộ phim, 
nữ diễn_viên, ngôi_sao 
phim, thế_giới điện_ảnh, 
diễn_viên, thế_giới,, 
1. Giới_thiệu các ngôi_sao điện_ảnh thế_giới và Việt_nam 
Khi công_bố nữ diễn_viên chính sẽ đóng cặp với 
ngôi_sao điện_ảnh xứ_Hàn Bae_Yong_Joon là 
Lee_Ji_Ah khiến ai cũng ngỡ_ngàng 
2. Hai ngôi_sao võ_thuật của Trung_Quốc góp_mặt trong 
bộ phim mới Hai ngôi_sao võ_thuật của Trung_Quốc 
góp_mặt trong bộ phim mới King of_Kungfu 
3. phim nhiều thể loại thế_giới điện_ảnh với các thông_tin 
nóng_hổi Phim đời_tư các ngôi_sao điện_ảnh của 
thế_giới và việt_nam Phim bình_luận về các bộ phim hay 
Lịch chiếu_phim trên_HBO CINEMAX STAR_MOVIE 
2 ca_nhạc(12), ngôi_sao 
ca_nhạc, ban_nhạc, 
nhạc pop, thể_thao, 
yêu_mến, mariah, 
mariah_carey, nhạc_sĩ, 
pop, 
1. Các ngôi_sao ca_nhạc ban_nhạc nổi_tiếng Trang 
chân_dung nghệ_sĩ ca_sĩ nhạc_sĩ và ban_nhạc nổi_tiếng 
2. Năm 2005 Eva ký hợp_đồng với Oréal hãng mỹ_phẩm 
nổi_tiếng của Pháp hợp_đồng đã đưa cô lên ngang_hàng 
với các ngôi_sao như ca_sĩ da_màu Beyonce 
3. Các ngôi_sao ca_nhạc ban_nhạc nổi_tiếng Mariah 
Tuy_nhiên lợi_thế trẻ_trung của teen_star vẫn chưa 
đủ_sức làm lu_mờ một số ngôi_sao gạo_cội trong_đó có 
Mariah_Carey Người_ta gọi cô là Diva quên tuổi 
3 bóng_đá(10) 
sân_cỏ, ngôi_sao 
sân_cỏ, premiership, 
vòng chung_kết, vòng 
thi, trung_tâm, tphcm, 
giải, ronaldo, 
1. Milan chưa từ_bỏ ý_định mua_Ronaldinho Chủ_tịch 
Milan Silvio_Berlusconi cho_biết ông sẵn_sàng nối_lại 
đàm phám với Barca về trường_hợp của Ronaldinho khi 
đội_bóng chủ sân 
2. Tin nhanh bóng_đá Ngôi_sao sân_cỏ Cháy_mãi Ronaldo 
24h.com.vn Tin nhanh bóng_đá 
3. 24h.com.vn Tin nhanh bóng_đá Ngôi_sao sân_cỏ 
Premiership 24h.com.vn Tin nhanh bóng_đá Ngôi_sao 
sân_cỏ Các tin khác của mục Ngôi_sao sân_cỏ Tổng_hợp 
24H 
4 mặt_trời(5) 
quỹ_đạo, quanh, 
thiên_hà, địa_cầu, 
hành_tinh, get the, lùn, 
single_network, 
vietnamkhoa_học, 
1. Phot hiện 28 hành_tinh mới ngoài hệ mặt_trời 
2. Người góp_chìa Sao lùn cực nhẹ là những ngôi_sao có 
khối_lượng nhỏ hơn 0,3 lần khối_lượng mặt_trời 
3. an binh hanh_phuc Kích_thước của các ngôi_sao vŕ 
khoảng_cách của chúng đối_với trái_đất vượt Mặt_trời 
của chúng_ta lŕ ngôi_sao gần chúng_ta nhất cách địa_cầu 
độ_chừng 
5 mặc đẹp(3) 
mặc, thời_trang, mot so, 
nhat, beyonce knowles, 
tạp_chí life, 
biên_tập_viên, knowles, 
nữ_ca_sĩ, 
1. VnExpress Anh 10 ngoi sao thoi trang nhat the_gioi 
Nữ_ca_sĩ Beyonce Knowles được các biên_tập_viên của 
tạp_chí Life Style Mỹ bầu chọn là ngôi_sao mặc đẹp và 
cá_tính nhất năm_nay 
2. 10 Ngôi_Sao Quyến_Rũ Nhất Trung_Quốc eVietBay 10 
Ngôi_Sao Quyến_Rũ Nhất Trung_Quốc Tin_Tức về 
Thời_Trang Điện_Ảnh 
58 
Clustering results for the query “thị trường” (market) 
1 otc(29) 
thị_trường otc, niêm_yết, 
cổ_phiếu otc, cổ_phiếu, 
một_số cổ_phiếu, 
công_ty niêm_yết, 
1. Chứng_khoán Ngân_hàng THÔNG_TIN THỊ_TRƯỜNG 
NGHIÊN_CỨU PHÂN_TÍCH TIN VCBS 
2. Vietstock Vietnam Stock Market News and_Information 
Thong_tin Thị_trường bất_động_sản Cơ_hội vàng cho 
các nhà đầu_tư 
3. Chứng_khoán Biển_Việt Báo_Cáo Tổng_Quan 
Thị_Trường Cổ_Phiếu 
2 kinh_tế thị_trường(20) 
nền kinh_tế, tăng_trưởng 
kinh_tế, tốc_độ 
tăng_trưởng, 
tăng_trưởng mạnh_mẽ, 
động_thái, , 
1. Làm điếm trong nền kinh_tế thị_trường H-A O 
2. Các chuyên_gia phân_tích thị_trường nhận_định 
thị_trường ĐTDĐ trong năm_nay sẽ không duy_trì được 
tốc_độ tăng_trưởng hai con_số như trong quý_IV 
3. Chính_phủ và thị_trường Diễn_đàn X cafe Vận_hành nền 
kinh_tế thị_trường có_nghĩa là nhiều vấn_đề sẽ 
3 đất(17) 
đất, thị_trường 
bất_động_sản, 
bất_động_sản, căn_hộ 
chung_cư, đất dự_án, 
từ_liêm, quy_hoạch, 
lô_đất, giá đất, nhà_ở, 
1. Saigon bất_động_sản Bất_động_sản Nhà_đất Địa_ốc 
Xây_dựng Bước vào quý_ 
2. Thị_trường bất_động_sản Thành_phố Hồ_Chí_Minh vẫn 
đang sôi VietNamNet_Bridge 
3. hị_trường BĐS TP.HCM Vùng ven lên_giá Cotec Group 
Website Đất các quận_huyện vùng ven_như Thủ_Đức 
Cần_Giờ Bình_Chánh 
4 điện_thoại di_động(11) 
điện_thoại di_động, fpt, 
viễn_thông di_động, 
công_nghệ cdma, 
dịch_vụ điện_thoại 
1. MOBILENET Mobile Online Magazine Cú đột_phá trên 
thị_trường điện_thoại Năm 2006 khi E Com dịch_vụ 
điện_thoại cố_định 
2. Thị trường viễn thông di động sẽ có sự thay đổi lớn kể từ 
năm 2006 
3. Thị trường di động Việt Nam nửa đầu năm 2007 
            Các file đính kèm theo tài liệu này:
MSc08_Nguyen_Cam_Tu_Thesis_English.pdf