Luận văn Www and the pagerank-Related problems

Introduction Human beings’ history has witnessed many big improvements but nothing is comparable to networks’, in general, and the WWW’s improvement, in partic- ular. During the last decades, databases have grown exponentially in large stores and companies. In the old days, seekers faced many difficulties in finding enough data to feed their needs. The picture has changed and now the reverse picture is a daily problem – how to find relevant data in the information which is regularly accumulated. The need therefore arises for better approaches that are able to handle complex models in a reasonable amount of time because the old models in statistics, machine learning and traditional data analysis fail to cope with this level of complexity. That is named data-mining with an important branch, Web-mining. The key inspiration of web-mining based on WWW’s growth. There is no doubt that WWW is the most valuable source of information but it is not very easy for you to get used to if you do not know the way. Several studies try to estimate the size of theWeb andmost of themagree that there are over billions of pages. Additionally, the changing rate of the Web is even more dramatic [4]. According to [4], the size of the Web has doubled in less than two years, and this growth rate is projected to continue for the next two years. Although a enormous amount of information is available on the Web, you can hardly find appropriate information if you nothing supporting. For that reason, you need a search engine to make your work easier. We can consider search engines roughly as places that store information and return what we need after some operations. Everything returned by a iv search engine is sorted descendingly, that means entries which appear first seem to be more relevant. Even if that seems normal, many efforts were made to make this step straightforward. For various ways of approaching, there are many methods that deal with this problem. In this problem, we have to deter- mine a page’s importance score in order to sort it in the list of pages [16, 6]. The most famous approach discovered is PageRankTM from Google [1]. This thesis focuses on PageRank and its modifications to tweak up the computing process. With the aim of understanding network phenomena and, especially, PageR- ank computation, this thesis is organized as follow: ã Chapter 1: Demontrates on the rank concept and PageRank. This chap- ter is the background on which the following chapters flourish. ã Chapter 2: Takes care of some PageRank-related problems such as ac- celerating problem, spam. And, a new method is also proposed in this chapter (section 3.2) which we has published [19]. ã Chapter 3: Concentrates on the practical side of PageRank with im- plementations and experiments. Conclusions and future works are also mentioned in this chapter. Contents List of Figures ii List of Tables iii Introduction iv Acknowledgement vi Abstract ix List of Glossaries xi 1 Objects’ ranks and applications to WWW 1 1.1 Rank of objects . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1.1 Rank for objects different from web page. . . . . . . . . . 2 1.1.2 Linkage usage in search engine . . . . . . . . . . . . . 4 1.2 Basis of PageRank . . . . . . . . . . . . . . . . . . . . . . 10 1.2.1 Mathematical basis . . . . . . . . . . . . . . . . . . . 11 1.2.2 Practical issues. . . . . . . . . . . . . . . . . . . . . 13 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 19 2 Some PageRank-related problems 20 2.1 Accelerating problems . . . . . . . . . . . . . . . . . . . . . 20 2.1.1 Related works . . . . . . . . . . . . . . . . . . . . . 21 2.1.2 Exploiting block structure of the Web . . . . . . . . . . . 22 2.2 Connected-component PageRank approach . . . . . . . . . . . 30 2.2.1 Initial ideas. . . . . . . . . . . . . . . . . . . . . . . 30 2.2.2 Mathematical basis of CCP . . . . . . . . . . . . . . . 32 2.2.3 On practical side of CCP. . . . . . . . . . . . . . . . . 35 2.3 Spam and spam detection . . . . . . . . . . . . . . . . . . . 37 2.3.1 Introduction to Web spam . . . . . . . . . . . . . . . . 37 2.3.2 Spam techniques . . . . . . . . . . . . . . . . . . . . 38 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 42 3 Implementations and experimental results 43 3.1 Search engine Nutch . . . . . . . . . . . . . . . . . . . . . 43 3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.2.1 Infrastructure and data sets . . . . . . . . . . . . . . . 48 3.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . 48 Conclusion of chapter . . . . . . . . . . . . . . . . . . . . . . . 58 Conclusions and future works 59 References 61

81 trang | Chia sẻ: maiphuongtl | Lượt xem: 2383 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Luận văn Www and the pagerank-Related problems, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

in block I. That is, if P is the web graph and li is the local Page-Rank of page i in block I, then weight of edge BIJ is given by: BIJ = ∑ i∈I,j∈J Pij.li They wrote it in matrix notations. Define the local PageRank matrix L to be the n× k matrix whose columns are the local PageRank vectors lJ . L =   l1 0 · · · 0 0 l2 · · · 0 ... ... . . . ... 0 0 · · · lk   Define the matrix S to be the n × k matrix that has the same structure as L, but whose nonzero entries are all replaced by 1. Then the block matrix B is the k × k matrix: B = LTPS 27 2.1 Accelerating problems Notice that B is a transition matrix where the element BIJ represents the transition probability of block I to block J . That is: BIJ = p(J |I) Once we have the k × k transition matrix B, we may use the standard PageR- ank algorithm on this reduced matrix to compute the BlockRank vector b. In this notation, each block is considered as a web page and the PageRank algo- rithm is applied same as for web pages. Using relative importance of blocks for computing PageRank At this point, there are two sets of rankings, one set contains relative ranks of web pages in each block and another set contains relative ranks of blocks in the web graph. Within each block J , we have the local PageRanks of the pages in the block bJ . We also have the BlockRank vector b whose elements bJ are the BlockRank for each block J , measuring the relative importance of the blocks. We may now approximate the global PageRank of a page j ∈ J as its local PageRank lj , weighted by the BlockRank bJ of the block in which it resides. That is, pi (0) j = lj × bJ In matrix notation: pi(0) = L× b And pi(0) is used as the start vector for the regular PR algorithm. Conclusion of approach More specific, this approach is presented in Algorithm 2.2. 28 2.1 Accelerating problems Algorithm 2.2 Specific version of BlockRank 1: procedure BLOCKRANK(G) . G is the whole Web graph 2: Sort the web graph lexicographically . to induce intra-host block 3: for all block J do . compute the local PageRank vector of block J 4: lJ = PageRank(J) 5: end for 6: B = LTGS . B is a matrix 7: b = PageRank(B) . B is understood as a graph of blocks 8: pi(0) = L× b 9: pi = PageRank(G) . pi(0) is used as the start vector 10: return pi 11: end procedure This approach has shown that the hyperlink graph of the web has a nested block structure, something that has not yet been thoroughly investigated in studies of the web. And, this structure is exploited to compute PageRank in a fast manner using an algorithm which is so-called BlockRank. As shown in Figure 2.1: Convergence rates for standard PageRank (solid line) vs. Block- Rank (dotted line). The x-axis is the number of iterations, and the y-axis is the log of the L1-residual. Figure 2.1, BlockRank speeds up PageRank computations by factors of 2 and higher, depending on the particular scenario. There are a number of areas for future work: finding the “best” blocks for BlockRank by splitting up what would be slow-mixing blocks with internal nested block structure; using the 29 2.2 Connected-component PageRank approach block structure for hyperlink-based algorithms other than web search, such as in clustering or classification; and exploring more fully the topics of updates and personalized PageRank. 2.2 Connected-component PageRank approach In this section, we will discuss about advantages of using connected compo- nents, how connected-component concept is applied in computation of PageR- ank and we will prove some work generated on the theory side. We also propose an approach which utilizes connected component idea from graph theory. The proposal is so-called CCP, a short for "Connected Component PageRank". 2.2.1 Initial ideas While taking care of Markov model [17], we pay special attention to a defini- tion that states in Markov chains can be divided into classes; states which are transposable are grouped into one class. This concept is similar to concept of connected components in graph theory. Moreover, as mentioned, the web graph is presented using adjacent matrix. This gives us a chance to apply connected component concept into PageRank computation. The proposed algorithm is given as follow. Algorithm 2.3 General steps of CCP algorithm 1: function CCP(G) . G is the Web graph 2: Split the web into blocks by connected component 3: Compute the local PageRanks for each block 4: Form an approximate global PageRank vector pi for web pages by com- bining eigen-vectors of blocks 5: end function Figure 2.2 illustrates the idea of CCP. 30 2.2 Connected-component PageRank approach (a) Unarranged adjacent matrix (b) Arranged adjacent matrix Figure 2.2: Unarranged matrix and arranged matrix Applying connected components into PageRank computation, we gain the following advantages. ¬ Regarding computational complexity: our method follow the trend of re- ducing cost needed for computing PageRank. We already know that time expenditure for multiplying a vector with a matrix is O(n2) in which n is size of common dimension. For that reason, if we compute PageRank by constructing its adjacent matrix, we face a hard problem when n reaches billions. By using connected components in computing PageRank, we can utilize it to reduce time needed. Here are some fundamental details about how we use connected components in PageRank problem. Assume that our web graph has about k connected components, therefore we have k small adjacent matrices (we call blocks) obtained from matrix P . Let max be the biggest number of pages in an arbitrary block. For every block, time needed to compute PageRank vector is smaller or equals to O(n2max) and if compare to whole web graph, it is much smaller. For all blocks, complexity can not exceed the amount of k × O(n2max) which is smaller than O(n2) where n is the size of the graph. Generally, on the theory side, our method is more efficient than trivial PageRank. Regarding convenience in finding connected components: In [18], we can get many effective algorithms for finding connected components of graphs. 31 2.2 Connected-component PageRank approach Time needed for determining connected components is O(m+n) in which m is the number of edges and n is the number of vertices. This additional time is not remarkable, compares to the previous. ® Regarding job of combining with other accelerating methods:We can have even a better performance if we run our method in hybrid mode with some other accelerating methods. In each block, we can use MAP or Ex- trapolation method to reduce computational time for that block, so the total reduced time is remarkable. ¯ In terms of programming techniques, [14] refers to a programming way which splits web graph matrix into small parts for computing in parallel. Using our method, we can have PageRank computation parallelized eas- ily. We can assign each block to one processor in mediocre case; for this, things are very natural. There certainly exists many ways that we can parallelize PageRank work and CCP is a very promising one. Notice that these advantages was not ensured by a theoretical proof or practi- cal experiments. If we split web graph matrix into smaller blocks corresponding to connected components, we will have to answer these questions: ¬ Is CCP method correct? How do we use method for effectiveness? We will concentrate on more descriptive information of CCP method in the following sections: 3.2.2 and 3.2.3. 2.2.2 Mathematical basis of CCP How did we use connected components in computing PageRank vector? Firstly, we repeat the PageRank computation as what has been stated in chapter 1 but with a slight modification. Suppose that we have web graph G with corre- sponding adjacent matrix P . We do not remove all leak nodes, instead of that, we define P˜ = αP + (1− α) n J (2.1) 32 2.2 Connected-component PageRank approach And the principal eigen vector of P˜ is computed as follow: pi = piP˜ (2.2) Secondly, let k be the number of connected components in G, let Pi be the adjacent matrix size ni × ni corresponding to the ith component, k∑ i=1 ni = n. Thus, we can re-arrange P as follow: P =   P1 0 · · · 0 0 P2 · · · 0 ... ... . . . ... 0 0 · · · Pk   (2.3) We define matrices: P˜i = αPi + (1− α) ni Ji i = 1, k, Ji is ni × ni size (2.4) For each block Pi, i = 1, k, formula for the principal eigen-vector is: pii = piiP˜i (2.5) Proposition 2.1. From (2.1), (2.2), (2.3), (2.4) and (2.5) ˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) (2.6) is eigen-vector of matrix P˜ . Proof. To prove that ˜ is eigen-vector of matrix P˜ , we have to prove that˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) satisfies equation (2.2). Replace ˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) into equation (2.2), we have: ˜ = ˜P˜ = ˜ [αP + (1− α) n J ] = ˜  α   P1 0 · · · 0 0 P2 · · · 0 ... ... . . . ... 0 0 · · · Pk  + (1− α) n J   (2.7) 33 2.2 Connected-component PageRank approach ⇔ (n1 n pi1, n2 n pi2, . . . , nk n pik ) = (n1 n pi1, n2 n pi2, . . . , nk n pik ) ×  αP1 + 1− α n J11 1− α n J12 · · · 1− α n J1k 1− α n J21 αP2 + 1− α n J21 · · · 1− α n J2k ... ... . . . ... 1− α n Jk1 1− α n J2k · · · αPk + 1− α n Jkk   (2.8) in which Jij is ni × nj size matrix of all 1s. Multiply the right side of equation (2.8) and consider the first component, we have: n1 n pi1 = n1 n pi1 ( αP1 + 1− α n J11 ) + n2 n pi2 1− α n J21 + · · ·+ nk n pik 1− α n Jk1 (2.9) For every block Pi, PageRank pii of that block satisfies equation (2.5), that means: pii = piiP˜i = pi i ( αPi + 1− α ni Ji ) (2.10) ⇔ ni n pii = ni n pii ( αPi + 1− α ni Ji ) (2.11) Let i be 1, we get: n1 n pi1 = n1 n pi1 ( αP1 + 1− α n1 J1 ) (2.12) From (2.9), (2.12) we have: n1 n pi1 ( αP1 + 1− α n1 J1 ) = n1 n pi1 ( αP1 + 1− α n J11 ) + n2 n pi2 1− α n J21+· · ·+ nk n pik 1− α n Jk1 (2.13) Since J1 = J11, so (2.13)⇔ pi1J11 = n1 n pi1J11 + n2 n pi2J21 + · · ·+ nk n pikJk1 (2.14) ⇔ 1n×1 = 1n×1 × k∑ i=1 ni n in which 1n×1 is a column vector of all 1s (2.15) 34 2.2 Connected-component PageRank approach Note that, in our definition, k∑ i=1 ni = n so (2.15) is always true. Do the same process to component i, i = 2, k, we get the expected result. Consequently, the proposition is true. We have the proposition proved so the first question is answered, CCP is true. In the next section, we briefly suggest a way of using CCP effectively. 2.2.3 On practical side of CCP CCP method can be divided into three main steps as follow: ¬ First step: Get connected components from whole web graph and gener- ate corresponding adjacent graph, Second step: Compute each block’s eigen-vector to get PageRank vector, ® Third step: Combine the last PageRank vector as shown in Proposition 2.1. A more detailed version of CCP can be found in Algorithm 2.4. Algorithm 2.4 Connected-component PageRank algorithm 1: function CCP(G) . G is the whole Web graph 2: Construct matrix P . P is the adjacent matrix of G 3: Determine connected components Pi in G . i = 1, k 4: for all Pi do 5: compute Pi’s eigen-vector pii . Pi is the ith connected component 6: compute ni n pii . ni × ni is the size of block Pi 7: end for 8: ˜i ← ni n pii . ˜ = (n1 n pi1, n2 n pi2, . . . , nk n pik ) 9: return ˜ 10: end function Note that we have introduced some advantages in the previous section but all of them are obvious or induced from other approaches. To be listed here are major advantages of proposal over standard PageRank algorithm. 35 2.2 Connected-component PageRank approach Advantage 1 The major speedup that this algorithm can obtain comes from one of the main directions in researching of PageRank – how to fit PageR- ank problem into memory. All of the blocks in our crawl are small enough so that each block graph fits in main memory, and the vector of ranks for the active block largely fits in the CPU cache. As the full graph does not fit entirely in main memory, the local PageRank (of each block) itera- tions thus require less disk I/O than the global computations. Indeed, if we manipulate the crawling process, specially designed for CCP, as the input instead to the standard PageRank algorithm, the result would be much better. Advantage 2 In CCP algorithm, the local PageRank vectors for many blocks will converge quickly; thus the computations of those blocks may be ter- minated after only a few iterations. This increases the effectiveness of the local PageRank computation by allowing it to expend more computation on slowly converging blocks, and less computation on faster converging blocks. We will discuss more on this advantage in Chapter 3. In the stan- dard PageRank algorithm, iterations operate on the whole graph; thus the convergence bottleneck is largely due to the slowest blocks. Much computation is wasted recomputing the PageRank of blocks whose local computation has already converged. Advantage 3 The local PageRank computations in Second Step of the CCP can be computed in a completely parallel or distributed fashion. That is, the local PageRanks for each block Pi can be computed on a separate pro- cessor, or computer. The only communication required is that, at the end of Second step, each computer should send their local PageRank vector to a central computer that will compute the global PageRank vector. If our graph consists of n total pages, the net communication cost consists of 8n bytes (if using 8-byte double precision floating point values). Naive paral- lelization of the computation that does not exploit block structure would require a transfer of 8nbytes after each iteration, a significant penalty. 36 2.3 Spam and spam detection 2.3 Spam and spam detection Web spamming refers to actions intended to mislead search engines into rank- ing some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. We will get through a comprehensive taxonomy of current spamming techniques and some fighting-spam techniques in this section. We use the term spamming (also, spamdexing) to refer to any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value. In com- mon, the word spam is used to mark all those web objects (page content items or links) that are the result of some form of spamming. People who perform spamming are called spammers [11, 10]. 2.3.1 Introduction to Web spam As more and more people rely on the wealth of information available online, increased exposure on the WWW may yield significant financial gains for indi- viduals or organizations. Most frequently, search engines are the entryways to the Web; that is why some people try to mislead search engines, so that their pages would rank high in search results, and thus, capture user attention. Just as with emails, we can talk about the phenomenon of spamming the Web. The primary consequence of web spamming is that the quality of search results decreases. For instance, a site may contain only a few lines of useful in- formation, but consists of thousands of pages, each repeating the same content and pointing to dozens of other pages. All the pages in that site were probably created to boost the rankings of some others, and none of them seems to be particularly useful for anyone looking for useful information. The secondary consequence of spamming is that search engine indexes are in inflated with useless pages, increasing the cost of each processed query. To provide low-cost, quality services, it is critical for search engines to ad- dress web spam. Search engines currently fight spam with a variety of often manual techniques, but as far as we know, they still lack a fully effective set of tools for combating it. The very first step in combating spam is understand- 37 2.3 Spam and spam detection ing it that is analyzing the techniques the spammers use to mislead search engines. A proper understanding of spamming can then guide the develop- ment of appropriate countermeasures. To that end, in [10], web spamming techniques are organized into a taxonomy that can provide a framework for combating spam, [10] gives us very useful information about spam such as ways that spams are created, in-depth discussion about spam... Furthermore, Zoltán G. et. al. [10] proposes a method, namely TrustRank, to combat web spam. A newly created spam technique, blog spam, is treated well with Lan- guage Model Disagreement in [15]. One can also find details for several specific techniques on the Web itself. 2.3.2 Spam techniques To build a taxonomy, authors of [10] worked closely with experts at one of the major search engine companies, relying on their experience, while at the same time investigating numerous spam instances on their own. As a sep- arated work, [11] agrees with [10] on almost ideas. We will investigate the common parts of these papers to have a basic understanding of web spam. There are two categories of techniques associated with web spam. The first category includes the boosting techniques, i.e., methods through which one seeks to achieve high relevance and/or importance for some pages. The second category includes hiding techniques, methods that by themselves do not in influence the search engine’s ranking algorithms, but that are used to hide the adopted boosting techniques from the eyes of human web users. Term spamming Term spamming [10], or text spamming [11], is the way that web pages are manipulated with redundancy of some specific terms to get higher ranks or tailored with text fields in order to make spam pages relevant for some queries. As we had known, when a query is issued, its content and contents of web pages will be evaluated to get the relevance score. Relevance score is calculated based on term frequencies, term positions... 5 In which, term positions and term frequencies are the relatively easy criteria to be modified so that pages would get more ranks. In evaluating relevance, term positions receive more 5Refer to chapter 1 for more details 38 2.3 Spam and spam detection Figure 2.3: Boosting techniques attention, [10] calls each type of location that a term appears in a field. The common text fields for a page p are the document body, the title, the meta tags in the HTML header, and page p’s URL. In addition, the anchor texts associated with URLs that point to p are also considered belonging to page p (anchor text field), since they often describe very well the content of p. The terms in p’s text fields are used to determine the relevance of p with respect to a specific query (a group of query terms), often with different weights given to different fields. Authors of [10, 11] mention the following techniques which are grouped based on the text field in which the spamming occurs. They are: • Body spam: In this case, the spam terms are included in the document body. This spamming technique is among the simplest and most popular ones, and it is almost as old as search engines themselves. • Title spam: Today’s search engines usually give a higher weight to terms 39 2.3 Spam and spam detection that appear in the title of a document. Hence, it makes sense to include the spam terms in the document title. • Meta tag spam: The HTMLmeta tags that appear in the document header have always been the target of spamming. Because of the heavy spam- ming, search engines currently give low priority to these tags, or even ignore them completely. • Anchor text spam: Just as with the document title, search engines as- sign higher weight to anchor text terms, as they are supposed to offer a summary of the pointed document. Therefore, spam terms are sometimes included in the anchor text of the HTML hyperlinks to a page. Please note that this spamming technique is different from the previous ones, in the sense that the spam terms are added not to a target page itself, but the other pages that point to the target. As anchor text gets indexed for both pages, spamming it has impact on the ranking of both the source and target pages. • URL spam: URLs of pages may be broken down by search engines in order to compute relevance scores. Spammers can exploit this by creating long URLs which contains spam terms. We have introduced some positions that spammers use to achieve better ranks for their web pages. Now, considering the number of occurrences of terms, we distinguish some ways: • Repetition of one or a few specific words in content of page. If an issued query is relevant, this page will receive higher relevance score. • Dumping a large amount of words so that the page would be relevant to many different queries. • Weaving spam terms into a copied content. A copied content can be news from a reliable source in which spam terms are randomly inserted. • Phrase stitching is the technique in which sentences or phrases from many different sources are weaved. To this method, spammers could get relevance to many kinds of queries because sentences are collected from different sources. 40 2.3 Spam and spam detection Link spamming Web spam involves in not only term spamming but also link spamming. Many major search engines use link-based algorithm to determine ranks of web pages and the most famous is PageRank. Many spammers create link structure which will exploit link-based algorithm to obtain better seats in search result. Spammers can use outgoing and incoming links as tools to make their goal come true. They can add a number of outgoing links to a popular page or gather many incoming links to a specific page or a group of pages. Outgoing links Numerous outgoing links might be added to popular pages in hope of spam- mers to get better hub scores for their pages. A spammer can use directory cloning technique to reach this goal. On the WWW, there are some directory sites which contain links of many topics. For example, DMOZ is one of the most famous in which we can find content of many topics and sub topics by navi- gating link lists. Spammers copy this idea and create their own directory web which resembles a popular directory web, it is called directory cloning. By this action, big amount of outgoing links are easily generated. Incoming links Spammers might deploy one of the following strategies: • Using a honey pot as a tool to attract users. A honey pot may be a page which provides free, pirated softwares... or useful resources, in general. This honey pot also contains links to spam page(s) and links are hidden from users. The more people uses this honey pot, the more it is popular. So many users may desire to link to this honey pot. When users point their links to the honey pot, they indirectly increase the ranking of spam page(s). Note that directory sites can be used as a honey pot, too. • Infiltrate web directory: Many web directories allow people to post links on their directories. But it might happen that editors of these directories do not strictly manage the given contents. So spammers would post links to target pages to get boosted in both PageRank and authorities scores. 41 2.3 Spam and spam detection • Post links to blogs, wikis... This is a hard problem that search engine could resolve. Links can be given directly to blog comments, wikis... or can be hidden by using anchor text. For this technique, spammers would get their target pages more popular. • Exchange links: A group of spammers would create a link structure in which pages are circle-pointed. Source and target pages in this strategy will fly higher in both PageRank and popularity. • Using expired domains: When a domain is newly out-of-date, spammers will buy it. Since that domain has just expired, incoming links to it still exist. Moreover, spammers can populate this old domain with some tricks. Then, spammers will create links from this expired domain to their target pages. Thus, target pages takes advantages from old links to an expired domain. • Create spam farm: Spammers, nowadays, have control of many sites and they can create link structure that boost the ranking of target pages. Link spamming is used more and more because PageRank algorithm is still deployed by Google. For that reason, many approaches appear to treat with this arisen problem. Conclusion of chapter This chapter has discussed about some problems which arose with PageRank computation. The underlined problem is accelerating problem which involves in many fields. We can benefit from storage techniques to reduce time taken to compute PageRank computation. Besides, we can take advantages from the structure of the web to speed up the computational time. Furthermore, nu- merical methods give us many promising approaches by which we just have to modify a little the way of programming to get a nice result. In the next chapter, experimental results are presented so that we will have a clearer picture of how PageRank and other methods are handled. 42 CHAPTER Implementations and experimental results Nowadays, hardly can a newly born result in any informatics field achieve agreement if it does not have experimental figures. This the- sis is not out of that stream. In this chapter, introduction of an or- dinary search engine, Nutch, is shown. We used Nutch, with some modifications and additions in code, to crawl data then the obtained data was processed to get experimental results. 3.1 Search engine Nutch As all of us know, building a search engine is very, very hard because of enor- mous amount of works need to be done. Thus, such effective search engine as Google is so successful in the market. But every effort made is tending towards making more and more profit, not for academic purpose. How could researchers, who want to do research about search engine, make their experi- ments? The answer may be Nutch [2]. This section concentrates on some gen- eral information about Nutch and the work needed to utilize Nutch for our 43 3.1 Search engine Nutch purpose. Nutch is an open source Java implementation of a search engine. It provides all of the tools you need to run your own search engine. But why would anyone want to run their own search engine? After all, there is always Google. There are at least three reasons. ¬ Transparency:Nutch is open source, so anyone can see how the ranking algorithms work. With commercial search engines, the precise details of the algorithms are secret so you can never know why a particular search result is ranked as it is. Furthermore, some search engines allow rank- ings to be based on payments, rather than on the relevance of the site’s contents. Nutch is a good fit for academic and government organizations, where the perception of fairness of rankings may be more important. Understanding: We do not have the source code to Google, so Nutch is probably the best we have. It is interesting to see how a large search en- gine works. Nutch has been built using ideas from academia and indus- try: for instance, core parts of Nutch are currently being reimplementa- tion to use the Map Reduce distributed processing model, which emerged from Google Labs last year. And Nutch is attractive for researchers who want to try out new search algorithms, since it is so easy to extend. ® Extensibility: Do not like the way other search engines display their re- sults? Write your own search engine–using Nutch! Nutch is very flexible: it can be customized and incorporated into your application. For develop- ers, Nutch is a great platform for adding search to heterogeneous collec- tions of information, and being able to customize the search interface, or extend the out-of-the-box functionality through the plug-in mechanism. For example, you can integrate it into your site to add a search capability. Nutch installations typically operate at one of three scales: local file system, intranet, or whole web. All three have different characteristics. For instance, crawling a local file system is reliable compared to the other two, since network errors do not occur and caching copies of the page content is unnecessary (and actually a waste of disk space). Whole web crawling lies at the other extreme. Crawling billions of pages creates a whole host of engineering problems to be 44 3.1 Search engine Nutch solved: which pages do we start with? How do we partition the work between a set of crawlers? How often do we recrawl? How do we cope with broken links, unresponsive sites, and unintelligible or duplicate content? There is another set of challenges to solve to deliver scalable search–how do we cope with hun- dreds of concurrent queries on such a large dataset? This portion tells us about the architecture of Nutch. Nutch has a highly modular architecture that uses plug-in APIs for media-type parsing, HTML analysis, data retrieval protocols, and queries. The core has four major compo- nents: • Searcher:Given a query, it must quickly find a small relevant subset of a corpus of documents, then present them. Finding a large relevant subset is normally done with an inverted index of the corpus; ranking within that set to produce the most relevant documents, which then must be summarized for display. • Indexer: Creates the inverted index from which the searcher extracts results. It uses Lucene storing indexes. • Database: Stores the document contents for indexing and later summa- rization by the searcher, along with information such as the link struc- ture of the document space and the time each document was last fetched. • Fetcher:Requests web pages, parses them, and extracts links from them. Nutch’s robot has been written entirely from scratch. Figure 3.1 outlines the relationships between elements that refer on each other, placing them in the same box, and those they depend on in a lower layer. For example, protocol does not depend on net, because protocol is only an interface point to plug-ins that actually provide much of Nutch’s functionality. Crawling An intranet or niche search engine might only take a single machine a few hours to crawl, while a whole-web crawl might take many machines several weeks or longer. A single crawling cycle consists of generating a fetch list from the webdb, fetching those pages, parsing those for links, then updating the 45 3.1 Search engine Nutch Figure 3.1: Nutch packages dependencies webdb. Nutch’s crawler supports both a crawl-and-stop and crawl-and-stop- with-threshold (which requires feedback from scoring and specifying a floor). It also uses a uniform refresh policy; all pages are refetched at the same inter- val (30 days, by default) regardless of how frequently they change. The fetching process must also respect bandwidth and other limitations of the target web- site. However, any polite solution requires coordination before fetching; Nutch uses the most straightforward localization of references possible: namely, mak- ing all fetches from a particular host run on one machine. Indexing text Lucenemeets the scalability requirements for text indexing in Nutch. Nutch also takes advantage of Lucene’s multi-field case-folding keyword and phrase search in URLs, anchor text, and document text. The typical definition of Web search does not require some index types Lucene does not address, such as regular-expression matching (using, say, suffix trees) or document-similarity clustering (using, say, term-vectors). Indexing hypertext Lucene provides an inverted-file full-text index, which suffices for indexing text but not the additional tasks required by a web search engine. In addition to this, Nutch implements a link database to provide efficient access to the Web’s link graph, and a page database that stores crawled pages for indexing, 46 3.1 Search engine Nutch summarizing, and serving to users, as well as supporting other functions such as crawling and link analysis. Nutch also has to add support for HTML extrac- tion to Lucene. When a page is fetched, any embedded links are considered for addition to the fetch list and that link’s anchor text is also stored. What is eventually indexed by Lucene is only the text of a Web page, though: while a high-fidelity copy is stored in the page database, font, heading, and other structural information do not propagate to the Lucene indexing layer. Removing duplicates The Nutch dedup command eliminates duplicate documents from a set of Lucene indices for Nutch segments, so it inherently requires access to all the segments at once. It’s a batch-mode process that has to be run before running searches to prevent the search from returning duplicate documents. It uses temporary files containing the 5-tuple MD5 hash, float score, int indexID, int docID, int urlLen for each page. Presumably the documents with the same URLs were fetched at different times, so Nutch tries to sort the records so that the ones for the newest fetches come first. A second pass, using hash=MD5(content) and with slightly different sorting rules, eliminates multiple URLs for the same docu- ment from the index. Link analysis Nutch includes a link analysis algorithm similar to PageRank. It is per- formed by the DistributedAnalysisTool; even the single-machine LinkAnaly- sisTool merely calls into it. Nutch has already demonstrated the ability to harness multiple servers to compute link ranking for 100Mpage subsets of the World Wide Web. Distributed link analysis is a bulk synchronous parallel process. At the beginning of each phase, the list of URLs whose scores must be updated is divided up into many chunks; in the middle, many processes produce score-edit files by finding all the links into pages in their particular chunk. At the end, an updating phase reads the score-edit files one at a time, merging their results into new scores for the pages in the web database. Because of its promising features, Nutch can be used for academic purposes as previously indicated. In the next section, we will discuss in detail about 47 3.2 Experiments experiments done with trivial PageRank and CCP. 3.2 Experiments Nowadays, almost all papers in informatics field need experimental figures to show the correctness of a new idea or advantages that it takes over other methods. This section covers experiments done by the author of the thesis and through figures, CCP insists that it is a promising approach which reduces the computational time and number of iterations needed in PageRank compu- tation. 3.2.1 Infrastructure and data sets All experiments were done with the IBM cluster 1350 from Center for High Performance Computing, Hanoi University of Science. The IBM cluster con- sists of 8 computing nodes, each with 2 Xeon processors at 3.2GHz, 2GB of RAM. Additionally, the total storage capacity that the cluster can use is about 1TB. Web pages are downloaded from many sources. These sources are web sites of many universities from all over the world: from US, from UK or from Singa- pore. Some main sites are from University of California at Los Angeles, Empo- ria University, and Albany University... (US); Hull University, Liverpool Uni- versity, Lancaster Management University... (UK) and Nanyang Technology University (Singapore). Sets of all pages, after having been de-duplicated, contain roughly about 1 million pages. But not all pages are in the same data sets. We divided pages into 3 sets according to 3 attempts of downloading. 3.2.2 Results We will demonstrate in the following way: first, the visualization of all ma- trices are shown (connected components and the whole matrix); next, ranks and differences in computing PageRank from two methods are discussed; it- 48 3.2 Experiments Set # Pages from 1 ASTON university UEL university STIR university NTU university 2 HULL university GRE university LMU university LIU university 3 LUTON university HUD university LMU university LIU university Table 3.1: Each set with corresponding sources Set # No. of pages No. of links 1 129,606 661,687 2 128,009 714,773 3 88,308 612,187 Table 3.2: Each set with corresponding number of pages and links erations and time needed will be illustrated to show that CCP is the better one. In 3.2, we draw matrices from the first set. Figure 3.2(a) show each com- ponent of the whole set including ASTON, UEL, STIR and NTU. Figure 3.2(b) show the arranged adjacent matrix. Look into the figures, we note that every component has a common charac- teristic: nearly all links are centralized in the under part. This phenomenon is quite easy to understand. Since every site has a primary main page which points to other minor main pages and from these secondary pages, links are distributed to lower pages in the hierarchical tree. For example, as shown in the component of Aston university, Aston has an index page from which we can get to other pages. This main page only links to pages which 49 3.2 Experiments 0 5000 1000015000 0 5000 10000 15000 Links = 158766 Aston 0 2 4 x 104 0 1 2 3 4 x 104 Links = 143241 Uel 0 1 2 3 x 104 0 1 2 3 x 104 Links = 182459 Stir 0 1 2 x 104 0 0.5 1 1.5 2 2.5 x 104 Links = 177221 Ntu (a) Small matrices 0 2 4 6 8 10 12 x 104 0 2 4 6 8 10 12 x 104 Links = 661687 (b) Arranged adjacent matrix Figure 3.2: Matrices from set 1 are representatives for member schools or departments or offices. In general, the main page does not link to too minor pages. Pages from the lower level in the hierarchical tree are often minor pages. But, not all pages from the lower level in the hierarchical tree are connected, so there is no link between them. That causes the matrix to be strong underneath. Figure 3.3 give us ranks of web pages computed by trivial PageRank algo- rithm and CCP. (a) Ranks with PageRank (b) Ranks with CCP Figure 3.3: Ranks from PageRank vs. CCP for set 1 Figure 3.4 give us ranks of web pages in each block and their differences 50 3.2 Experiments for set 1. Differences are determined by subtracting ranks from PageRank and CCP. (a) Ranks with CCP for each block (b) Differences of PageRank and CCP Figure 3.4: Ranks and differences corresponding to each block for set 1 Figure 3.5 and 3.6 show statistic information about time and number of iterations needed to compute ranks with 2 methods: trivial PageRank and CCP for set 1. 0.70.750.80.850.9 50 100 150 200 250 300 350 400 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Computation time of PageRank method 0.70.750.80.850.9 50 100 150 200 250 300 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Computation time of CCP method Figure 3.5: Time needed to compute ranks of PageRank and CCP with differ- ent decay values for set 1 51 3.2 Experiments 0.70.750.80.850.9 20 40 60 80 100 120 140 160 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Number of iterations needed for PageRank method 0.70.750.80.850.9 20 40 60 80 100 120 140 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Number of iterations needed for CCP method Figure 3.6: Number of iteration needed to compute ranks of PageRank and CCP with different decay values for set 1 In 3.7, we draw matrices from the second set. Figure 3.7(a) show each com- ponent of the whole set including HULL, GRE, LMU and LIU. Figure 3.7(b) show the arranged adjacent matrix. 0 1 2 3 x 104 0 1 2 3 x 104 Links = 167722 Gre 0 1 2 3 x 104 0 1 2 3 x 104 Links = 205644 Hull 0 5000 10000 0 5000 10000 Links = 180950 Lmu 0 2 4 x 104 0 1 2 3 4 x 104 Links = 160457 Liu (a) Small matrices 0 2 4 6 8 10 12 x 104 0 2 4 6 8 10 12 x 104 Links = 714773 (b) Arranged adjacent matrix Figure 3.7: Matrices from set 2 Figure 3.8 give us ranks of web pages computed by trivial PageRank algo- rithm and CCP. 52 3.2 Experiments (a) Ranks with PageRank (b) Ranks with CCP Figure 3.8: Ranks from PageRank vs. CCP for set 2 Figure 3.9 give us ranks of web pages in each block and their differences for set 2. (a) Ranks with CCP for each block (b) Differences of PageRank and CCP Figure 3.9: Ranks and differences corresponding to each block for set 2 Figure 3.10 and 3.11 show statistic information about time and number of iterations needed to compute ranks with 2 methods: trivial PageRank and CCP for set 2. 53 3.2 Experiments 0.70.750.80.850.9 50 100 150 200 250 300 350 400 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Computation time of PageRank method 0.70.750.80.850.9 50 100 150 200 250 300 350 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Computation time of CCP method Figure 3.10: Time needed to compute ranks of PageRank and CCP with differ- ent decay values for set 2 0.70.750.80.850.9 20 40 60 80 100 120 140 160 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Number of iterations needed for PageRank method 0.70.750.80.850.9 20 40 60 80 100 120 140 160 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Number of iterations needed for CCP method Figure 3.11: Number of iteration needed to compute ranks of PageRank and CCP with different decay values for set 2 In 3.12, we draw matrices from the third set. Figure 3.12(a) show each component of the whole set including Luton, Hud, Lmu and LIU. Figure 3.12(b) show the arranged adjacent matrix. Figure 3.13 give us ranks of web pages computed by trivial PageRank algo- rithm and CCP. 54 3.2 Experiments 0 2000400060008000 0 2000 4000 6000 8000 Links = 153394 Luton 0 1 2 x 104 0 0.5 1 1.5 2 x 104 Links = 117331 Hud 0 5000 10000 0 5000 10000 Links = 180950 Lmu 0 2 4 x 104 0 1 2 3 4 x 104 Links = 160457 Liu (a) Small matrices 0 2 4 6 8 x 104 0 1 2 3 4 5 6 7 8 x 104 Links = 612187 (b) Arranged adjacent matrix Figure 3.12: Matrices from set 3 (a) Ranks with PageRank (b) Ranks with CCP Figure 3.13: Ranks from PageRank vs. CCP for set 3 Figure 3.14 give us ranks of web pages in each block and their differences for set 3. 55 3.2 Experiments (a) Ranks with CCP for each block (b) Differences of PageRank and CCP for each block Figure 3.14: Ranks and differences corresponding to each block for set 3 Figure 3.15 and 3.16 show statistic information about time and number of iterations needed to compute ranks with 2 methods: trivial PageRank and CCP for set 3. 0.70.750.80.850.9 20 40 60 80 100 120 140 160 180 200 220 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Computation time of PageRank method 0.70.750.80.850.9 20 40 60 80 100 120 140 160 180 200 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Computation time of CCP method Figure 3.15: Time needed to compute ranks of PageRank and CCP with differ- ent decay values for set 3 56 3.2 Experiments 0.70.750.80.850.9 20 40 60 80 100 120 140 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Number of iterations needed for PageRank method 0.70.750.80.850.9 20 40 60 80 100 120 140 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Number of iterations needed for CCP method Figure 3.16: Number of iteration needed to compute ranks of PageRank and CCP with different decay values for set 3 As figuring out, CCP really took less cost than trivial PageRank in both time and number of iterations needed for computation. In Figure 3.17, we can see the mean time and mean number of iterations for each approach done over 3 sets above. 0.70.750.80.850.9 20 40 60 80 100 120 140 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (a) Mean time needed for 3 sets 0.70.750.80.850.9 20 40 60 80 100 120 140 ε = 10−8 ε = 10−7 ε = 10−6 ε = 10−5 ε = 10−4 (b) Mean number of iterations needed for 3 sets Figure 3.17: Mean time and number of iterations needed from 2 methods for 3 sets with deferent decay values 57 3.2 Experiments Conclusion of chapter This chapter concentrates on the practical side of CCP with results of experi- ments given. As demonstrated in figures, CCP is really a promising approach which makes use of connected components into computing PageRank. This method not only needs less time but also fewer iterations for each data set. Thus, all exertions decrease by having [???percent] cut-off. Furthermore, this approach would pledge us a nice opportunity to parallelize one of the biggest problems that a search engine may deal with – PageRank computation. 58 Conclusions and future works This thesis stresses on a new emerging but fast improving issue in information retrieval field, that is rank evaluation between objects. Searching for relevant information from a huge set would return a big amout of results, especially searching on the Internet with a search engine. Unless a search engine has a good ranking algorithm to sort web pages, it does not deserve to be co-called. Ranking algorithm involves in many disciplines from mathematics knowl- edge to informatics such as storing data, compression techniques or algorithm design. Google, one of the best search engines, brought out PageRank, a rank- ing algorithm which utilizes link analysis. Because PageRank is effective, many studies were paid to make it better. These accelerating methods really help speeding up PageRank but no one uses connected components as its core. This thesis has proposed a new method that make use of connected compo- nent in the Web graph. CCP method is guaranteed by mathematical proof and it is stated to be more efficient than trivial PageRank. In addition to theory side, many experiments were made to strengthen the statement. Experimental results show that CCP runs [???percent] faster in which time is reduce about [???percent] and iterations are [???percent] less. It will be a fault if we do not mention future works. Actually, with CCP, there are many tweak-ups that can be made to refine it. First of all, CCP must consider the differences between blocks and try to balance this trouble. Another aspect of PageRank that CCP can deal with is spam. We had talked about link farm in section 2 and this would be a nice fit to CCP if we use connected component to solve the mess that spam brings. The last but not least is parallelization of CCP. CCP is very natural to parallel and the author 59 3.2 Experiments of thesis intends make a parallelized version of CCP. 60 References [1] Google search engine: [2] NUTCH project: [3] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. PageRank computation and the structure of the Web: Experiments and algorithms. In Proceed- ings of the Eleventh International World Wide Web Conference, 2002. [4] Avrind Arasu, Junghoo Cho, Hector Garcia-Molina, Andreas Paepcke, and Sriram Raghavan. Searching the web. ACM Transaction on Inter- net Technology, 1:2–43, 2001. [5] P. Boldi and S. Vigna. WebGraph framework: Compression techniques. In Proceedings of the Thirteenth International World Wide Web Conference, 2004. [6] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer networks and ISDN systems, 30:107–117, 1998. [7] A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Graph struture in the Web. Com- puter Networks, 33:309–320, 2000. [8] J. Cho and S. Roy. Impact of Web search engines on page popularity. In Proceedings of the Thirteenth International World Wide Web Conference, 2004. 61 REFERENCES [9] D. Fetterly, M. Manasse, M. Najork, and J. L. Wiener. A large-scale study of the evolution of Web pages. In Proceedings of the Twelfth International World Wide Web Conference, 2003. [10] Zoltán Gyo¨ngyi and Hector Garcia-Molina. Web spam taxonomy. Techni- cal report, Stanford University, 2005. [11] Monika R. Henzinger, Rajeev Motwani, and Craig Silverstern. Challenges in Web search engines. Technical report, Google Inc., Oct. 2002. [12] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Extrap- olation methods for accelerating PageRank computations. In Proceedings of the Twelfth International World Wide Web Conference, 2003. [13] Sepandar Kamvar, Taher Haveliwala, and Gene Golub. Adaptive methods for the computation of PageRank. Special issue on the conference on the Numerical solution of Markov chains, pages 51–65, 2003. [14] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, and Gene H. Golub. Exploiting the block structure of the Web for computing PageRank. Technical report, Stanford University, 2003. [15] Gilad Mishne, David Carmel, and Ronny Lempel. Blocking blog spam with Language Model Disagreement. In Proceedings of AIRWeb ’05, 2005. [16] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web. Technical report, Stanford University, 1998. [17] Sheldon Ross. Introduction to probability models, 8th Edition. Academic Press, 2003. [18] Robert Sedgewick. Algorithms. Addison - Wesley, 1983. [19] Ha Quang Thuy, Nguyen Hoai Nam, and Nguyen Thu Trang. Improve performance of PageRank computation with Connected Component PageRank. In Proceedings of the 1st ICMOCCA Conference, pages 154–160. IEEE Seoul Section, Aug. 2006. 62

Các file đính kèm theo tài liệu này:

MSc07_Nguyen_Hoai_Nam_Thesis_English.pdf