Introduction
Human beings’ history has witnessed many big improvements but nothing is
comparable to networks’, in general, and the WWW’s improvement, in partic-
ular.
During the last decades, databases have grown exponentially in large stores
and companies. In the old days, seekers faced many difficulties in finding
enough data to feed their needs. The picture has changed and now the reverse
picture is a daily problem – how to find relevant data in the information which
is regularly accumulated. The need therefore arises for better approaches that
are able to handle complex models in a reasonable amount of time because
the old models in statistics, machine learning and traditional data analysis
fail to cope with this level of complexity. That is named data-mining with an
important branch, Web-mining.
The key inspiration of web-mining based on WWW’s growth. There is no
doubt that WWW is the most valuable source of information but it is not very
easy for you to get used to if you do not know the way. Several studies try to
estimate the size of theWeb andmost of themagree that there are over billions
of pages. Additionally, the changing rate of the Web is even more dramatic [4].
According to [4], the size of the Web has doubled in less than two years, and
this growth rate is projected to continue for the next two years. Although a
enormous amount of information is available on the Web, you can hardly find
appropriate information if you nothing supporting. For that reason, you need
a search engine to make your work easier.
We can consider search engines roughly as places that store information
and return what we need after some operations. Everything returned by a
iv
search engine is sorted descendingly, that means entries which appear first
seem to be more relevant. Even if that seems normal, many efforts were made
to make this step straightforward. For various ways of approaching, there are
many methods that deal with this problem. In this problem, we have to deter-
mine a page’s importance score in order to sort it in the list of pages [16, 6].
The most famous approach discovered is PageRankTM from Google [1]. This
thesis focuses on PageRank and its modifications to tweak up the computing
process.
With the aim of understanding network phenomena and, especially, PageR-
ank computation, this thesis is organized as follow:
ã Chapter 1: Demontrates on the rank concept and PageRank. This chap-
ter is the background on which the following chapters flourish.
ã Chapter 2: Takes care of some PageRank-related problems such as ac-
celerating problem, spam. And, a new method is also proposed in this
chapter (section 3.2) which we has published [19].
ã Chapter 3: Concentrates on the practical side of PageRank with im-
plementations and experiments. Conclusions and future works are also
mentioned in this chapter.
Contents
List of Figures ii
List of Tables iii
Introduction iv
Acknowledgement vi
Abstract ix
List of Glossaries xi
1 Objects’ ranks and applications to WWW 1
1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Rank for objects different from web page. . . . . . . . . . 2
1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4
1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10
1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11
1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19
2 Some PageRank-related problems 20
2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20
2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21
2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22
2.2 Connected-component PageRank approach . . . . . . . . . . . 30
2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32
2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35
2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37
2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37
2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42
3 Implementations and experimental results 43
3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48
3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48
Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58
Conclusions and future works 59
References 61
81 trang |
Chia sẻ: maiphuongtl | Lượt xem: 2104 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Luận văn Www and the pagerank-Related problems, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
in block I. That is, if P is the web graph
and li is the local Page-Rank of page i in block I, then weight of edge BIJ is
given by:
BIJ =
∑
i∈I,j∈J
Pij.li
They wrote it in matrix notations. Define the local PageRank matrix L to be
the n× k matrix whose columns are the local PageRank vectors lJ .
L =
l1 0 · · · 0
0 l2 · · · 0
...
... . . .
...
0 0 · · · lk
Define the matrix S to be the n × k matrix that has the same structure as L,
but whose nonzero entries are all replaced by 1. Then the block matrix B is
the k × k matrix:
B = LTPS
27
2.1 Accelerating problems
Notice that B is a transition matrix where the element BIJ represents the
transition probability of block I to block J . That is:
BIJ = p(J |I)
Once we have the k × k transition matrix B, we may use the standard PageR-
ank algorithm on this reduced matrix to compute the BlockRank vector b. In
this notation, each block is considered as a web page and the PageRank algo-
rithm is applied same as for web pages.
Using relative importance of blocks for computing PageRank
At this point, there are two sets of rankings, one set contains relative ranks
of web pages in each block and another set contains relative ranks of blocks in
the web graph. Within each block J , we have the local PageRanks of the pages
in the block bJ . We also have the BlockRank vector b whose elements bJ are the
BlockRank for each block J , measuring the relative importance of the blocks.
We may now approximate the global PageRank of a page j ∈ J as its local
PageRank lj , weighted by the BlockRank bJ of the block in which it resides.
That is,
pi
(0)
j = lj × bJ
In matrix notation:
pi(0) = L× b
And pi(0) is used as the start vector for the regular PR algorithm.
Conclusion of approach
More specific, this approach is presented in Algorithm 2.2.
28
2.1 Accelerating problems
Algorithm 2.2 Specific version of BlockRank
1: procedure BLOCKRANK(G) . G is the whole Web graph
2: Sort the web graph lexicographically . to induce intra-host block
3: for all block J do . compute the local PageRank vector of block J
4: lJ = PageRank(J)
5: end for
6: B = LTGS . B is a matrix
7: b = PageRank(B) . B is understood as a graph of blocks
8: pi(0) = L× b
9: pi = PageRank(G) . pi(0) is used as the start vector
10: return pi
11: end procedure
This approach has shown that the hyperlink graph of the web has a nested
block structure, something that has not yet been thoroughly investigated in
studies of the web. And, this structure is exploited to compute PageRank in
a fast manner using an algorithm which is so-called BlockRank. As shown in
Figure 2.1: Convergence rates for standard PageRank (solid line) vs. Block-
Rank (dotted line). The x-axis is the number of iterations, and the y-axis is the
log of the L1-residual.
Figure 2.1, BlockRank speeds up PageRank computations by factors of 2 and
higher, depending on the particular scenario. There are a number of areas
for future work: finding the “best” blocks for BlockRank by splitting up what
would be slow-mixing blocks with internal nested block structure; using the
29
2.2 Connected-component PageRank approach
block structure for hyperlink-based algorithms other than web search, such as
in clustering or classification; and exploring more fully the topics of updates
and personalized PageRank.
2.2 Connected-component PageRank approach
In this section, we will discuss about advantages of using connected compo-
nents, how connected-component concept is applied in computation of PageR-
ank and we will prove some work generated on the theory side. We also propose
an approach which utilizes connected component idea from graph theory. The
proposal is so-called CCP, a short for "Connected Component PageRank".
2.2.1 Initial ideas
While taking care of Markov model [17], we pay special attention to a defini-
tion that states in Markov chains can be divided into classes; states which are
transposable are grouped into one class. This concept is similar to concept of
connected components in graph theory. Moreover, as mentioned, the web graph
is presented using adjacent matrix. This gives us a chance to apply connected
component concept into PageRank computation.
The proposed algorithm is given as follow.
Algorithm 2.3 General steps of CCP algorithm
1: function CCP(G) . G is the Web graph
2: Split the web into blocks by connected component
3: Compute the local PageRanks for each block
4: Form an approximate global PageRank vector pi for web pages by com-
bining eigen-vectors of blocks
5: end function
Figure 2.2 illustrates the idea of CCP.
30
2.2 Connected-component PageRank approach
(a) Unarranged adjacent matrix (b) Arranged adjacent matrix
Figure 2.2: Unarranged matrix and arranged matrix
Applying connected components into PageRank computation, we gain the
following advantages.
¬ Regarding computational complexity: our method follow the trend of re-
ducing cost needed for computing PageRank. We already know that time
expenditure for multiplying a vector with a matrix is O(n2) in which n is
size of common dimension. For that reason, if we compute PageRank by
constructing its adjacent matrix, we face a hard problem when n reaches
billions. By using connected components in computing PageRank, we can
utilize it to reduce time needed. Here are some fundamental details about
how we use connected components in PageRank problem. Assume that
our web graph has about k connected components, therefore we have k
small adjacent matrices (we call blocks) obtained from matrix P . Let max
be the biggest number of pages in an arbitrary block. For every block,
time needed to compute PageRank vector is smaller or equals to O(n2max)
and if compare to whole web graph, it is much smaller. For all blocks,
complexity can not exceed the amount of k × O(n2max) which is smaller
than O(n2) where n is the size of the graph. Generally, on the theory
side, our method is more efficient than trivial PageRank.
Regarding convenience in finding connected components: In [18], we can
get many effective algorithms for finding connected components of graphs.
31
2.2 Connected-component PageRank approach
Time needed for determining connected components is O(m+n) in which
m is the number of edges and n is the number of vertices. This additional
time is not remarkable, compares to the previous.
® Regarding job of combining with other accelerating methods:We can have
even a better performance if we run our method in hybrid mode with
some other accelerating methods. In each block, we can use MAP or Ex-
trapolation method to reduce computational time for that block, so the
total reduced time is remarkable.
¯ In terms of programming techniques, [14] refers to a programming way
which splits web graph matrix into small parts for computing in parallel.
Using our method, we can have PageRank computation parallelized eas-
ily. We can assign each block to one processor in mediocre case; for this,
things are very natural. There certainly exists many ways that we can
parallelize PageRank work and CCP is a very promising one.
Notice that these advantages was not ensured by a theoretical proof or practi-
cal experiments.
If we split web graph matrix into smaller blocks corresponding to connected
components, we will have to answer these questions:
¬ Is CCP method correct?
How do we use method for effectiveness?
We will concentrate on more descriptive information of CCP method in the
following sections: 3.2.2 and 3.2.3.
2.2.2 Mathematical basis of CCP
How did we use connected components in computing PageRank vector? Firstly,
we repeat the PageRank computation as what has been stated in chapter 1
but with a slight modification. Suppose that we have web graph G with corre-
sponding adjacent matrix P . We do not remove all leak nodes, instead of that,
we define
P˜ = αP +
(1− α)
n
J (2.1)
32
2.2 Connected-component PageRank approach
And the principal eigen vector of P˜ is computed as follow:
pi = piP˜ (2.2)
Secondly, let k be the number of connected components in G, let Pi be the
adjacent matrix size ni × ni corresponding to the ith component,
k∑
i=1
ni = n.
Thus, we can re-arrange P as follow:
P =
P1 0 · · · 0
0 P2 · · · 0
... ... . . . ...
0 0 · · · Pk
(2.3)
We define matrices:
P˜i = αPi +
(1− α)
ni
Ji i = 1, k, Ji is ni × ni size (2.4)
For each block Pi, i = 1, k, formula for the principal eigen-vector is:
pii = piiP˜i (2.5)
Proposition 2.1. From (2.1), (2.2), (2.3), (2.4) and (2.5)
˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
(2.6)
is eigen-vector of matrix P˜ .
Proof. To prove that ˜ is eigen-vector of matrix P˜ , we have to prove that˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
satisfies equation (2.2).
Replace ˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
into equation (2.2), we have:
˜ = ˜P˜
= ˜ [αP + (1− α)
n
J
]
= ˜
α
P1 0 · · · 0
0 P2 · · · 0
... ... . . . ...
0 0 · · · Pk
+
(1− α)
n
J
(2.7)
33
2.2 Connected-component PageRank approach
⇔
(n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
=
(n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
×
αP1 +
1− α
n
J11
1− α
n
J12 · · ·
1− α
n
J1k
1− α
n
J21 αP2 +
1− α
n
J21 · · ·
1− α
n
J2k
... ... . . . ...
1− α
n
Jk1
1− α
n
J2k · · · αPk +
1− α
n
Jkk
(2.8)
in which Jij is ni × nj size matrix of all 1s.
Multiply the right side of equation (2.8) and consider the first component,
we have:
n1
n
pi1 =
n1
n
pi1
(
αP1 +
1− α
n
J11
)
+
n2
n
pi2
1− α
n
J21 + · · ·+
nk
n
pik
1− α
n
Jk1 (2.9)
For every block Pi, PageRank pii of that block satisfies equation (2.5), that
means:
pii = piiP˜i = pi
i
(
αPi +
1− α
ni
Ji
)
(2.10)
⇔
ni
n
pii =
ni
n
pii
(
αPi +
1− α
ni
Ji
)
(2.11)
Let i be 1, we get:
n1
n
pi1 =
n1
n
pi1
(
αP1 +
1− α
n1
J1
)
(2.12)
From (2.9), (2.12) we have:
n1
n
pi1
(
αP1 +
1− α
n1
J1
)
=
n1
n
pi1
(
αP1 +
1− α
n
J11
)
+
n2
n
pi2
1− α
n
J21+· · ·+
nk
n
pik
1− α
n
Jk1
(2.13)
Since J1 = J11, so
(2.13)⇔ pi1J11 =
n1
n
pi1J11 +
n2
n
pi2J21 + · · ·+
nk
n
pikJk1 (2.14)
⇔ 1n×1 = 1n×1 ×
k∑
i=1
ni
n
in which 1n×1 is a column vector of all 1s (2.15)
34
2.2 Connected-component PageRank approach
Note that, in our definition,
k∑
i=1
ni = n so (2.15) is always true.
Do the same process to component i, i = 2, k, we get the expected result.
Consequently, the proposition is true.
We have the proposition proved so the first question is answered, CCP is
true. In the next section, we briefly suggest a way of using CCP effectively.
2.2.3 On practical side of CCP
CCP method can be divided into three main steps as follow:
¬ First step: Get connected components from whole web graph and gener-
ate corresponding adjacent graph,
Second step: Compute each block’s eigen-vector to get PageRank vector,
® Third step: Combine the last PageRank vector as shown in Proposition
2.1.
A more detailed version of CCP can be found in Algorithm 2.4.
Algorithm 2.4 Connected-component PageRank algorithm
1: function CCP(G) . G is the whole Web graph
2: Construct matrix P . P is the adjacent matrix of G
3: Determine connected components Pi in G . i = 1, k
4: for all Pi do
5: compute Pi’s eigen-vector pii . Pi is the ith connected component
6: compute
ni
n
pii . ni × ni is the size of block Pi
7: end for
8: ˜i ← ni
n
pii . ˜ = (n1
n
pi1,
n2
n
pi2, . . . ,
nk
n
pik
)
9: return ˜
10: end function
Note that we have introduced some advantages in the previous section but
all of them are obvious or induced from other approaches. To be listed here are
major advantages of proposal over standard PageRank algorithm.
35
2.2 Connected-component PageRank approach
Advantage 1 The major speedup that this algorithm can obtain comes from
one of the main directions in researching of PageRank – how to fit PageR-
ank problem into memory. All of the blocks in our crawl are small enough
so that each block graph fits in main memory, and the vector of ranks for
the active block largely fits in the CPU cache. As the full graph does not
fit entirely in main memory, the local PageRank (of each block) itera-
tions thus require less disk I/O than the global computations. Indeed, if
we manipulate the crawling process, specially designed for CCP, as the
input instead to the standard PageRank algorithm, the result would be
much better.
Advantage 2 In CCP algorithm, the local PageRank vectors for many blocks
will converge quickly; thus the computations of those blocks may be ter-
minated after only a few iterations. This increases the effectiveness of the
local PageRank computation by allowing it to expend more computation
on slowly converging blocks, and less computation on faster converging
blocks. We will discuss more on this advantage in Chapter 3. In the stan-
dard PageRank algorithm, iterations operate on the whole graph; thus
the convergence bottleneck is largely due to the slowest blocks. Much
computation is wasted recomputing the PageRank of blocks whose local
computation has already converged.
Advantage 3 The local PageRank computations in Second Step of the CCP
can be computed in a completely parallel or distributed fashion. That is,
the local PageRanks for each block Pi can be computed on a separate pro-
cessor, or computer. The only communication required is that, at the end
of Second step, each computer should send their local PageRank vector to
a central computer that will compute the global PageRank vector. If our
graph consists of n total pages, the net communication cost consists of 8n
bytes (if using 8-byte double precision floating point values). Naive paral-
lelization of the computation that does not exploit block structure would
require a transfer of 8nbytes after each iteration, a significant penalty.
36
2.3 Spam and spam detection
2.3 Spam and spam detection
Web spamming refers to actions intended to mislead search engines into rank-
ing some pages higher than they deserve. Recently, the amount of web spam
has increased dramatically, leading to a degradation of search results. We will
get through a comprehensive taxonomy of current spamming techniques and
some fighting-spam techniques in this section.
We use the term spamming (also, spamdexing) to refer to any deliberate
human action that is meant to trigger an unjustifiably favorable relevance
or importance for some web page, considering the page’s true value. In com-
mon, the word spam is used to mark all those web objects (page content items
or links) that are the result of some form of spamming. People who perform
spamming are called spammers [11, 10].
2.3.1 Introduction to Web spam
As more and more people rely on the wealth of information available online,
increased exposure on the WWW may yield significant financial gains for indi-
viduals or organizations. Most frequently, search engines are the entryways to
the Web; that is why some people try to mislead search engines, so that their
pages would rank high in search results, and thus, capture user attention.
Just as with emails, we can talk about the phenomenon of spamming the
Web. The primary consequence of web spamming is that the quality of search
results decreases. For instance, a site may contain only a few lines of useful in-
formation, but consists of thousands of pages, each repeating the same content
and pointing to dozens of other pages. All the pages in that site were probably
created to boost the rankings of some others, and none of them seems to be
particularly useful for anyone looking for useful information. The secondary
consequence of spamming is that search engine indexes are in inflated with
useless pages, increasing the cost of each processed query.
To provide low-cost, quality services, it is critical for search engines to ad-
dress web spam. Search engines currently fight spam with a variety of often
manual techniques, but as far as we know, they still lack a fully effective set
of tools for combating it. The very first step in combating spam is understand-
37
2.3 Spam and spam detection
ing it that is analyzing the techniques the spammers use to mislead search
engines. A proper understanding of spamming can then guide the develop-
ment of appropriate countermeasures. To that end, in [10], web spamming
techniques are organized into a taxonomy that can provide a framework for
combating spam, [10] gives us very useful information about spam such as
ways that spams are created, in-depth discussion about spam... Furthermore,
Zoltán G. et. al. [10] proposes a method, namely TrustRank, to combat web
spam. A newly created spam technique, blog spam, is treated well with Lan-
guage Model Disagreement in [15]. One can also find details for several specific
techniques on the Web itself.
2.3.2 Spam techniques
To build a taxonomy, authors of [10] worked closely with experts at one of
the major search engine companies, relying on their experience, while at the
same time investigating numerous spam instances on their own. As a sep-
arated work, [11] agrees with [10] on almost ideas. We will investigate the
common parts of these papers to have a basic understanding of web spam.
There are two categories of techniques associated with web spam. The first
category includes the boosting techniques, i.e., methods through which one
seeks to achieve high relevance and/or importance for some pages. The second
category includes hiding techniques, methods that by themselves do not in
influence the search engine’s ranking algorithms, but that are used to hide
the adopted boosting techniques from the eyes of human web users.
Term spamming
Term spamming [10], or text spamming [11], is the way that web pages are
manipulated with redundancy of some specific terms to get higher ranks or
tailored with text fields in order to make spam pages relevant for some queries.
As we had known, when a query is issued, its content and contents of web
pages will be evaluated to get the relevance score. Relevance score is calculated
based on term frequencies, term positions... 5 In which, term positions and
term frequencies are the relatively easy criteria to be modified so that pages
would get more ranks. In evaluating relevance, term positions receive more
5Refer to chapter 1 for more details
38
2.3 Spam and spam detection
Figure 2.3: Boosting techniques
attention, [10] calls each type of location that a term appears in a field. The
common text fields for a page p are the document body, the title, the meta
tags in the HTML header, and page p’s URL. In addition, the anchor texts
associated with URLs that point to p are also considered belonging to page p
(anchor text field), since they often describe very well the content of p. The
terms in p’s text fields are used to determine the relevance of p with respect to
a specific query (a group of query terms), often with different weights given to
different fields.
Authors of [10, 11] mention the following techniques which are grouped
based on the text field in which the spamming occurs. They are:
• Body spam: In this case, the spam terms are included in the document
body. This spamming technique is among the simplest and most popular
ones, and it is almost as old as search engines themselves.
• Title spam: Today’s search engines usually give a higher weight to terms
39
2.3 Spam and spam detection
that appear in the title of a document. Hence, it makes sense to include
the spam terms in the document title.
• Meta tag spam: The HTMLmeta tags that appear in the document header
have always been the target of spamming. Because of the heavy spam-
ming, search engines currently give low priority to these tags, or even
ignore them completely.
• Anchor text spam: Just as with the document title, search engines as-
sign higher weight to anchor text terms, as they are supposed to offer a
summary of the pointed document. Therefore, spam terms are sometimes
included in the anchor text of the HTML hyperlinks to a page. Please note
that this spamming technique is different from the previous ones, in the
sense that the spam terms are added not to a target page itself, but the
other pages that point to the target. As anchor text gets indexed for both
pages, spamming it has impact on the ranking of both the source and
target pages.
• URL spam: URLs of pages may be broken down by search engines in
order to compute relevance scores. Spammers can exploit this by creating
long URLs which contains spam terms.
We have introduced some positions that spammers use to achieve better ranks
for their web pages. Now, considering the number of occurrences of terms, we
distinguish some ways:
• Repetition of one or a few specific words in content of page. If an issued
query is relevant, this page will receive higher relevance score.
• Dumping a large amount of words so that the page would be relevant to
many different queries.
• Weaving spam terms into a copied content. A copied content can be news
from a reliable source in which spam terms are randomly inserted.
• Phrase stitching is the technique in which sentences or phrases from
many different sources are weaved. To this method, spammers could get
relevance to many kinds of queries because sentences are collected from
different sources.
40
2.3 Spam and spam detection
Link spamming
Web spam involves in not only term spamming but also link spamming.
Many major search engines use link-based algorithm to determine ranks of
web pages and the most famous is PageRank. Many spammers create link
structure which will exploit link-based algorithm to obtain better seats in
search result.
Spammers can use outgoing and incoming links as tools to make their goal
come true. They can add a number of outgoing links to a popular page or gather
many incoming links to a specific page or a group of pages.
Outgoing links
Numerous outgoing links might be added to popular pages in hope of spam-
mers to get better hub scores for their pages. A spammer can use directory
cloning technique to reach this goal. On the WWW, there are some directory
sites which contain links of many topics. For example, DMOZ is one of the most
famous in which we can find content of many topics and sub topics by navi-
gating link lists. Spammers copy this idea and create their own directory web
which resembles a popular directory web, it is called directory cloning. By this
action, big amount of outgoing links are easily generated.
Incoming links
Spammers might deploy one of the following strategies:
• Using a honey pot as a tool to attract users. A honey pot may be a page
which provides free, pirated softwares... or useful resources, in general.
This honey pot also contains links to spam page(s) and links are hidden
from users. The more people uses this honey pot, the more it is popular.
So many users may desire to link to this honey pot. When users point
their links to the honey pot, they indirectly increase the ranking of spam
page(s). Note that directory sites can be used as a honey pot, too.
• Infiltrate web directory: Many web directories allow people to post links
on their directories. But it might happen that editors of these directories
do not strictly manage the given contents. So spammers would post links
to target pages to get boosted in both PageRank and authorities scores.
41
2.3 Spam and spam detection
• Post links to blogs, wikis... This is a hard problem that search engine
could resolve. Links can be given directly to blog comments, wikis... or
can be hidden by using anchor text. For this technique, spammers would
get their target pages more popular.
• Exchange links: A group of spammers would create a link structure in
which pages are circle-pointed. Source and target pages in this strategy
will fly higher in both PageRank and popularity.
• Using expired domains: When a domain is newly out-of-date, spammers
will buy it. Since that domain has just expired, incoming links to it still
exist. Moreover, spammers can populate this old domain with some tricks.
Then, spammers will create links from this expired domain to their target
pages. Thus, target pages takes advantages from old links to an expired
domain.
• Create spam farm: Spammers, nowadays, have control of many sites and
they can create link structure that boost the ranking of target pages.
Link spamming is used more and more because PageRank algorithm is still
deployed by Google. For that reason, many approaches appear to treat with
this arisen problem.
Conclusion of chapter
This chapter has discussed about some problems which arose with PageRank
computation. The underlined problem is accelerating problem which involves
in many fields. We can benefit from storage techniques to reduce time taken
to compute PageRank computation. Besides, we can take advantages from the
structure of the web to speed up the computational time. Furthermore, nu-
merical methods give us many promising approaches by which we just have to
modify a little the way of programming to get a nice result.
In the next chapter, experimental results are presented so that we will have
a clearer picture of how PageRank and other methods are handled.
42
CHAPTER
Implementations and experimental
results
Nowadays, hardly can a newly born result in any informatics field
achieve agreement if it does not have experimental figures. This the-
sis is not out of that stream. In this chapter, introduction of an or-
dinary search engine, Nutch, is shown. We used Nutch, with some
modifications and additions in code, to crawl data then the obtained
data was processed to get experimental results.
3.1 Search engine Nutch
As all of us know, building a search engine is very, very hard because of enor-
mous amount of works need to be done. Thus, such effective search engine
as Google is so successful in the market. But every effort made is tending
towards making more and more profit, not for academic purpose. How could
researchers, who want to do research about search engine, make their experi-
ments? The answer may be Nutch [2]. This section concentrates on some gen-
eral information about Nutch and the work needed to utilize Nutch for our
43
3.1 Search engine Nutch
purpose.
Nutch is an open source Java implementation of a search engine. It provides
all of the tools you need to run your own search engine. But why would anyone
want to run their own search engine? After all, there is always Google. There
are at least three reasons.
¬ Transparency:Nutch is open source, so anyone can see how the ranking
algorithms work. With commercial search engines, the precise details of
the algorithms are secret so you can never know why a particular search
result is ranked as it is. Furthermore, some search engines allow rank-
ings to be based on payments, rather than on the relevance of the site’s
contents. Nutch is a good fit for academic and government organizations,
where the perception of fairness of rankings may be more important.
Understanding: We do not have the source code to Google, so Nutch is
probably the best we have. It is interesting to see how a large search en-
gine works. Nutch has been built using ideas from academia and indus-
try: for instance, core parts of Nutch are currently being reimplementa-
tion to use the Map Reduce distributed processing model, which emerged
from Google Labs last year. And Nutch is attractive for researchers who
want to try out new search algorithms, since it is so easy to extend.
® Extensibility: Do not like the way other search engines display their re-
sults? Write your own search engine–using Nutch! Nutch is very flexible:
it can be customized and incorporated into your application. For develop-
ers, Nutch is a great platform for adding search to heterogeneous collec-
tions of information, and being able to customize the search interface, or
extend the out-of-the-box functionality through the plug-in mechanism.
For example, you can integrate it into your site to add a search capability.
Nutch installations typically operate at one of three scales: local file system,
intranet, or whole web. All three have different characteristics. For instance,
crawling a local file system is reliable compared to the other two, since network
errors do not occur and caching copies of the page content is unnecessary (and
actually a waste of disk space). Whole web crawling lies at the other extreme.
Crawling billions of pages creates a whole host of engineering problems to be
44
3.1 Search engine Nutch
solved: which pages do we start with? How do we partition the work between
a set of crawlers? How often do we recrawl? How do we cope with broken links,
unresponsive sites, and unintelligible or duplicate content? There is another
set of challenges to solve to deliver scalable search–how do we cope with hun-
dreds of concurrent queries on such a large dataset?
This portion tells us about the architecture of Nutch. Nutch has a highly
modular architecture that uses plug-in APIs for media-type parsing, HTML
analysis, data retrieval protocols, and queries. The core has four major compo-
nents:
• Searcher:Given a query, it must quickly find a small relevant subset of a
corpus of documents, then present them. Finding a large relevant subset
is normally done with an inverted index of the corpus; ranking within
that set to produce the most relevant documents, which then must be
summarized for display.
• Indexer: Creates the inverted index from which the searcher extracts
results. It uses Lucene storing indexes.
• Database: Stores the document contents for indexing and later summa-
rization by the searcher, along with information such as the link struc-
ture of the document space and the time each document was last fetched.
• Fetcher:Requests web pages, parses them, and extracts links from them.
Nutch’s robot has been written entirely from scratch.
Figure 3.1 outlines the relationships between elements that refer on each
other, placing them in the same box, and those they depend on in a lower layer.
For example, protocol does not depend on net, because protocol is only an
interface point to plug-ins that actually provide much of Nutch’s functionality.
Crawling
An intranet or niche search engine might only take a single machine a few
hours to crawl, while a whole-web crawl might take many machines several
weeks or longer. A single crawling cycle consists of generating a fetch list from
the webdb, fetching those pages, parsing those for links, then updating the
45
3.1 Search engine Nutch
Figure 3.1: Nutch packages dependencies
webdb. Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop-
with-threshold (which requires feedback from scoring and specifying a floor).
It also uses a uniform refresh policy; all pages are refetched at the same inter-
val (30 days, by default) regardless of how frequently they change. The fetching
process must also respect bandwidth and other limitations of the target web-
site. However, any polite solution requires coordination before fetching; Nutch
uses the most straightforward localization of references possible: namely, mak-
ing all fetches from a particular host run on one machine.
Indexing text
Lucenemeets the scalability requirements for text indexing in Nutch. Nutch
also takes advantage of Lucene’s multi-field case-folding keyword and phrase
search in URLs, anchor text, and document text. The typical definition of Web
search does not require some index types Lucene does not address, such as
regular-expression matching (using, say, suffix trees) or document-similarity
clustering (using, say, term-vectors).
Indexing hypertext
Lucene provides an inverted-file full-text index, which suffices for indexing
text but not the additional tasks required by a web search engine. In addition
to this, Nutch implements a link database to provide efficient access to the
Web’s link graph, and a page database that stores crawled pages for indexing,
46
3.1 Search engine Nutch
summarizing, and serving to users, as well as supporting other functions such
as crawling and link analysis. Nutch also has to add support for HTML extrac-
tion to Lucene. When a page is fetched, any embedded links are considered
for addition to the fetch list and that link’s anchor text is also stored. What
is eventually indexed by Lucene is only the text of a Web page, though: while
a high-fidelity copy is stored in the page database, font, heading, and other
structural information do not propagate to the Lucene indexing layer.
Removing duplicates
The Nutch dedup command eliminates duplicate documents from a set of
Lucene indices for Nutch segments, so it inherently requires access to all the
segments at once. It’s a batch-mode process that has to be run before running
searches to prevent the search from returning duplicate documents. It uses
temporary files containing the 5-tuple
MD5 hash, float score, int indexID, int docID, int urlLen
for each page. Presumably the documents with the same URLs were fetched
at different times, so Nutch tries to sort the records so that the ones for the
newest fetches come first. A second pass, using hash=MD5(content) and with
slightly different sorting rules, eliminates multiple URLs for the same docu-
ment from the index.
Link analysis
Nutch includes a link analysis algorithm similar to PageRank. It is per-
formed by the DistributedAnalysisTool; even the single-machine LinkAnaly-
sisTool merely calls into it. Nutch has already demonstrated the ability to
harness multiple servers to compute link ranking for 100Mpage subsets of
the World Wide Web. Distributed link analysis is a bulk synchronous parallel
process. At the beginning of each phase, the list of URLs whose scores must
be updated is divided up into many chunks; in the middle, many processes
produce score-edit files by finding all the links into pages in their particular
chunk. At the end, an updating phase reads the score-edit files one at a time,
merging their results into new scores for the pages in the web database.
Because of its promising features, Nutch can be used for academic purposes
as previously indicated. In the next section, we will discuss in detail about
47
3.2 Experiments
experiments done with trivial PageRank and CCP.
3.2 Experiments
Nowadays, almost all papers in informatics field need experimental figures
to show the correctness of a new idea or advantages that it takes over other
methods. This section covers experiments done by the author of the thesis and
through figures, CCP insists that it is a promising approach which reduces
the computational time and number of iterations needed in PageRank compu-
tation.
3.2.1 Infrastructure and data sets
All experiments were done with the IBM cluster 1350 from Center for High
Performance Computing, Hanoi University of Science. The IBM cluster con-
sists of 8 computing nodes, each with 2 Xeon processors at 3.2GHz, 2GB of
RAM. Additionally, the total storage capacity that the cluster can use is about
1TB.
Web pages are downloaded from many sources. These sources are web sites
of many universities from all over the world: from US, from UK or from Singa-
pore. Some main sites are from University of California at Los Angeles, Empo-
ria University, and Albany University... (US); Hull University, Liverpool Uni-
versity, Lancaster Management University... (UK) and Nanyang Technology
University (Singapore).
Sets of all pages, after having been de-duplicated, contain roughly about 1
million pages. But not all pages are in the same data sets. We divided pages
into 3 sets according to 3 attempts of downloading.
3.2.2 Results
We will demonstrate in the following way: first, the visualization of all ma-
trices are shown (connected components and the whole matrix); next, ranks
and differences in computing PageRank from two methods are discussed; it-
48
3.2 Experiments
Set # Pages from
1 ASTON university
UEL university
STIR university
NTU university
2 HULL university
GRE university
LMU university
LIU university
3 LUTON university
HUD university
LMU university
LIU university
Table 3.1: Each set with corresponding sources
Set # No. of pages No. of links
1 129,606 661,687
2 128,009 714,773
3 88,308 612,187
Table 3.2: Each set with corresponding number of pages and links
erations and time needed will be illustrated to show that CCP is the better
one.
In 3.2, we draw matrices from the first set. Figure 3.2(a) show each com-
ponent of the whole set including ASTON, UEL, STIR and NTU. Figure 3.2(b)
show the arranged adjacent matrix.
Look into the figures, we note that every component has a common charac-
teristic: nearly all links are centralized in the under part. This phenomenon
is quite easy to understand. Since every site has a primary main page which
points to other minor main pages and from these secondary pages, links are
distributed to lower pages in the hierarchical tree. For example, as shown in
the component of Aston university, Aston has an index page
from which we can get to other pages. This main page only links to pages which
49
3.2 Experiments
0 5000 1000015000
0
5000
10000
15000
Links = 158766
Aston
0 2 4
x 104
0
1
2
3
4
x 104
Links = 143241
Uel
0 1 2 3
x 104
0
1
2
3
x 104
Links = 182459
Stir
0 1 2
x 104
0
0.5
1
1.5
2
2.5
x 104
Links = 177221
Ntu
(a) Small matrices
0 2 4 6 8 10 12
x 104
0
2
4
6
8
10
12
x 104
Links = 661687
(b) Arranged adjacent matrix
Figure 3.2: Matrices from set 1
are representatives for member schools or departments or offices. In general,
the main page does not link to too minor pages. Pages from the lower level in
the hierarchical tree are often minor pages. But, not all pages from the lower
level in the hierarchical tree are connected, so there is no link between them.
That causes the matrix to be strong underneath.
Figure 3.3 give us ranks of web pages computed by trivial PageRank algo-
rithm and CCP.
(a) Ranks with PageRank (b) Ranks with CCP
Figure 3.3: Ranks from PageRank vs. CCP for set 1
Figure 3.4 give us ranks of web pages in each block and their differences
50
3.2 Experiments
for set 1. Differences are determined by subtracting ranks from PageRank and
CCP.
(a) Ranks with CCP for each block (b) Differences of PageRank and CCP
Figure 3.4: Ranks and differences corresponding to each block for set 1
Figure 3.5 and 3.6 show statistic information about time and number of
iterations needed to compute ranks with 2 methods: trivial PageRank and CCP
for set 1.
0.70.750.80.850.9
50
100
150
200
250
300
350
400
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Computation time of PageRank method
0.70.750.80.850.9
50
100
150
200
250
300
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Computation time of CCP method
Figure 3.5: Time needed to compute ranks of PageRank and CCP with differ-
ent decay values for set 1
51
3.2 Experiments
0.70.750.80.850.9
20
40
60
80
100
120
140
160
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Number of iterations needed for
PageRank method
0.70.750.80.850.9
20
40
60
80
100
120
140
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Number of iterations needed for
CCP method
Figure 3.6: Number of iteration needed to compute ranks of PageRank and
CCP with different decay values for set 1
In 3.7, we draw matrices from the second set. Figure 3.7(a) show each com-
ponent of the whole set including HULL, GRE, LMU and LIU. Figure 3.7(b)
show the arranged adjacent matrix.
0 1 2 3
x 104
0
1
2
3
x 104
Links = 167722
Gre
0 1 2 3
x 104
0
1
2
3
x 104
Links = 205644
Hull
0 5000 10000
0
5000
10000
Links = 180950
Lmu
0 2 4
x 104
0
1
2
3
4
x 104
Links = 160457
Liu
(a) Small matrices
0 2 4 6 8 10 12
x 104
0
2
4
6
8
10
12
x 104
Links = 714773
(b) Arranged adjacent matrix
Figure 3.7: Matrices from set 2
Figure 3.8 give us ranks of web pages computed by trivial PageRank algo-
rithm and CCP.
52
3.2 Experiments
(a) Ranks with PageRank (b) Ranks with CCP
Figure 3.8: Ranks from PageRank vs. CCP for set 2
Figure 3.9 give us ranks of web pages in each block and their differences
for set 2.
(a) Ranks with CCP for each block (b) Differences of PageRank and CCP
Figure 3.9: Ranks and differences corresponding to each block for set 2
Figure 3.10 and 3.11 show statistic information about time and number
of iterations needed to compute ranks with 2 methods: trivial PageRank and
CCP for set 2.
53
3.2 Experiments
0.70.750.80.850.9
50
100
150
200
250
300
350
400
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Computation time of PageRank
method
0.70.750.80.850.9
50
100
150
200
250
300
350
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Computation time of CCP method
Figure 3.10: Time needed to compute ranks of PageRank and CCP with differ-
ent decay values for set 2
0.70.750.80.850.9
20
40
60
80
100
120
140
160
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Number of iterations needed for
PageRank method
0.70.750.80.850.9
20
40
60
80
100
120
140
160
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Number of iterations needed for
CCP method
Figure 3.11: Number of iteration needed to compute ranks of PageRank and
CCP with different decay values for set 2
In 3.12, we draw matrices from the third set. Figure 3.12(a) show each
component of the whole set including Luton, Hud, Lmu and LIU. Figure 3.12(b)
show the arranged adjacent matrix.
Figure 3.13 give us ranks of web pages computed by trivial PageRank algo-
rithm and CCP.
54
3.2 Experiments
0 2000400060008000
0
2000
4000
6000
8000
Links = 153394
Luton
0 1 2
x 104
0
0.5
1
1.5
2
x 104
Links = 117331
Hud
0 5000 10000
0
5000
10000
Links = 180950
Lmu
0 2 4
x 104
0
1
2
3
4
x 104
Links = 160457
Liu
(a) Small matrices
0 2 4 6 8
x 104
0
1
2
3
4
5
6
7
8
x 104
Links = 612187
(b) Arranged adjacent matrix
Figure 3.12: Matrices from set 3
(a) Ranks with PageRank (b)
Ranks
with
CCP
Figure 3.13: Ranks from PageRank vs. CCP for set 3
Figure 3.14 give us ranks of web pages in each block and their differences
for set 3.
55
3.2 Experiments
(a)
Ranks
with
CCP
for
each
block
(b) Differences of PageRank and CCP
for each block
Figure 3.14: Ranks and differences corresponding to each block for set 3
Figure 3.15 and 3.16 show statistic information about time and number
of iterations needed to compute ranks with 2 methods: trivial PageRank and
CCP for set 3.
0.70.750.80.850.9
20
40
60
80
100
120
140
160
180
200
220
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Computation time of PageRank
method
0.70.750.80.850.9
20
40
60
80
100
120
140
160
180
200
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Computation time of CCP method
Figure 3.15: Time needed to compute ranks of PageRank and CCP with differ-
ent decay values for set 3
56
3.2 Experiments
0.70.750.80.850.9
20
40
60
80
100
120
140
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Number of iterations needed for
PageRank method
0.70.750.80.850.9
20
40
60
80
100
120
140
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Number of iterations needed for
CCP method
Figure 3.16: Number of iteration needed to compute ranks of PageRank and
CCP with different decay values for set 3
As figuring out, CCP really took less cost than trivial PageRank in both
time and number of iterations needed for computation. In Figure 3.17, we can
see the mean time and mean number of iterations for each approach done over
3 sets above.
0.70.750.80.850.9
20
40
60
80
100
120
140
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(a) Mean time needed for 3 sets
0.70.750.80.850.9
20
40
60
80
100
120
140
ε = 10−8
ε = 10−7
ε = 10−6
ε = 10−5
ε = 10−4
(b) Mean number of iterations needed
for 3 sets
Figure 3.17: Mean time and number of iterations needed from 2 methods for 3
sets with deferent decay values
57
3.2 Experiments
Conclusion of chapter
This chapter concentrates on the practical side of CCP with results of experi-
ments given. As demonstrated in figures, CCP is really a promising approach
which makes use of connected components into computing PageRank. This
method not only needs less time but also fewer iterations for each data set.
Thus, all exertions decrease by having [???percent] cut-off. Furthermore, this
approach would pledge us a nice opportunity to parallelize one of the biggest
problems that a search engine may deal with – PageRank computation.
58
Conclusions and future works
This thesis stresses on a new emerging but fast improving issue in information
retrieval field, that is rank evaluation between objects. Searching for relevant
information from a huge set would return a big amout of results, especially
searching on the Internet with a search engine. Unless a search engine has a
good ranking algorithm to sort web pages, it does not deserve to be co-called.
Ranking algorithm involves in many disciplines from mathematics knowl-
edge to informatics such as storing data, compression techniques or algorithm
design. Google, one of the best search engines, brought out PageRank, a rank-
ing algorithm which utilizes link analysis. Because PageRank is effective,
many studies were paid to make it better. These accelerating methods really
help speeding up PageRank but no one uses connected components as its core.
This thesis has proposed a new method that make use of connected compo-
nent in the Web graph. CCP method is guaranteed by mathematical proof and
it is stated to be more efficient than trivial PageRank. In addition to theory
side, many experiments were made to strengthen the statement. Experimental
results show that CCP runs [???percent] faster in which time is reduce about
[???percent] and iterations are [???percent] less.
It will be a fault if we do not mention future works. Actually, with CCP,
there are many tweak-ups that can be made to refine it. First of all, CCP
must consider the differences between blocks and try to balance this trouble.
Another aspect of PageRank that CCP can deal with is spam. We had talked
about link farm in section 2 and this would be a nice fit to CCP if we use
connected component to solve the mess that spam brings. The last but not
least is parallelization of CCP. CCP is very natural to parallel and the author
59
3.2 Experiments
of thesis intends make a parallelized version of CCP.
60
References
[1] Google search engine:
[2] NUTCH project:
[3] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. PageRank computation
and the structure of the Web: Experiments and algorithms. In Proceed-
ings of the Eleventh International World Wide Web Conference, 2002.
[4] Avrind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke,
and Sriram Raghavan. Searching the web. ACM Transaction on Inter-
net Technology, 1:2–43, 2001.
[5] P. Boldi and S. Vigna. WebGraph framework: Compression techniques. In
Proceedings of the Thirteenth International World Wide Web Conference,
2004.
[6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web
search engine. Computer networks and ISDN systems, 30:107–117, 1998.
[7] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan,
R. Stata, A. Tomkins, and J. L. Wiener. Graph struture in the Web. Com-
puter Networks, 33:309–320, 2000.
[8] J. Cho and S. Roy. Impact of Web search engines on page popularity. In
Proceedings of the Thirteenth International World Wide Web Conference,
2004.
61
REFERENCES
[9] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study
of the evolution of Web pages. In Proceedings of the Twelfth International
World Wide Web Conference, 2003.
[10] Zoltán Gyo¨ngyi and Hector Garcia-Molina. Web spam taxonomy. Techni-
cal report, Stanford University, 2005.
[11] Monika R. Henzinger, Rajeev Motwani, and Craig Silverstern. Challenges
in Web search engines. Technical report, Google Inc., Oct. 2002.
[12] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrap-
olation methods for accelerating PageRank computations. In Proceedings
of the Twelfth International World Wide Web Conference, 2003.
[13] Sepandar Kamvar, Taher Haveliwala, and Gene Golub. Adaptive methods
for the computation of PageRank. Special issue on the conference on the
Numerical solution of Markov chains, pages 51–65, 2003.
[14] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and
Gene H. Golub. Exploiting the block structure of the Web for computing
PageRank. Technical report, Stanford University, 2003.
[15] Gilad Mishne, David Carmel, and Ronny Lempel. Blocking blog spam
with Language Model Disagreement. In Proceedings of AIRWeb ’05, 2005.
[16] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The
PageRank citation ranking: Bringing order to the Web. Technical report,
Stanford University, 1998.
[17] Sheldon Ross. Introduction to probability models, 8th Edition. Academic
Press, 2003.
[18] Robert Sedgewick. Algorithms. Addison - Wesley, 1983.
[19] Ha Quang Thuy, Nguyen Hoai Nam, and Nguyen Thu
Trang. Improve performance of PageRank computation with
Connected Component PageRank. In Proceedings of the 1st ICMOCCA
Conference, pages 154–160. IEEE Seoul Section, Aug. 2006.
62
Các file đính kèm theo tài liệu này:
- MSc07_Nguyen_Hoai_Nam_Thesis_English.pdf