Lời cảm ơn
Lời đầu tiên em xin bày tỏ lòng biết ơn sâu sắc tới TS. Nguyễn Hoài Sơn, các
thầy đã hướng dẫn và là nguồn cảm hứng cho quá trình nghiên cứu của em.
Em xin bày tỏ lòng biết ơn tới các thầy, cô giáo trong Khoa Công nghệ thông
tin - Trường Đại học Công nghệ - ĐHQGHN. Các thầy cô đã dạy bảo, chỉ dẫn
chúng em và luôn tạo điều kiện tốt nhất cho chúng em học tập trong suốt quá trình
học đại học đặc biệt là trong thời gian làm khoá luận tốt nghiệp.
Chapter 1
Table of Contents
Abstract Error! Bookmark not defined.
List images . 5
List tables . 7
Chapter 1: Peer to Peer and Ranking Problem 5
1.1. Peer to Peer . 5
1.1.1. Peer to Peer overview . 5
1.1.2. Architecture of Peer to Peer Systems Error! Bookmark not defined.7
1.1.3. Distributed hash tables . 8
1.2. Ranking in Peer to Peer networks . 9
1.2.1. Introduction Error! Bookmark not defined.
1.2.2. Ranking Roles Error! Bookmark not defined.
1.2.3. Research’s important objects . Error! Bookmark not defined.
Chapter 2: Ranking on DHT Peer to Peer Networks . 11
2.1. Chord Protocol 11
2.2. Pagerank 12
2.2.1. Description . 12
2.2.2. Algorithms . 13
2.3. Distributed Computing . 17
2.2.1. Introduction 17
2.2.2. Algorithms . Error! Bookmark not defined.
2.4 if-idf . 18
Chapter 3: Building a new algorithm for ranking in chord networksError! Bookmark not define
3.1. Targets and Missions of Research Error! Bookmark not defined.
3.2. Idea Error! Bookmark not defined.
iii
Research on Node Ranking – Peer to Peer
Hoàng Cường
3.2.1. Major problems to exploit . Error! Bookmark not defined.
3.2.2. Ranking Idea Error! Bookmark not defined.
Chapter 4: Ranking on Details Error! Bookmark not defined.
4.1. Ranking algorithm Error! Bookmark not defined.
4.2. Ranking’s features Error! Bookmark not defined.
Chapter 5: Evaluation . 50
Chapter 6: Related Work 52
Chapter 7: Contributions and future work 53
References . 54
59 trang |
Chia sẻ: banmai | Lượt xem: 2226 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Research on node ranking in peer - To - peer networks, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
The computer program which runs in the distributional system
said that a distributed program, the distribution programming writes such program' s
process. And the distributed computing mentions the use distributional system
explanation estimate question.
In the distributed computing, the question is divided many responsibilities, the
computer explains everybody.
20
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 2.3: Distributed Nodes Graph example
We pass use computer’s hope automation; s many responsibilities held
responsible with answer the type: We hope to ask the question, and the computer
should cause the answer. In the computer science theoretically, is called the estimate
question like this voluntarily. It is estimated that the question has each template
including the instance is an explanation officially together. The example is the
question which we asked that and the explanation is anticipates the answer to these
questions.
(How does the theory computer science seek needs to understand the estimate
question possibly through use that the complex theory solution computer (the
computability theory) and high efficiency computation). In the tradition, said the
question perhaps through the use solution computer, if perhaps we design all concrete
instances are correct explanation algorithm causes. Perhaps such algorithm possibly
implements the computer program which runs in an general calculator: Studies from
the input question instance's holiday eye, carries out some computation, and causes the
explanation to adopt the product.
Formalism for example random access ' perhaps the s machine or the universal
Turing machine use the achievement to carry out such algorithm continuously general
calculator' s abstraction model. In many computer situations, consistent and distributed
computing area research similar question or execution interaction process system
computer: Which estimate question how can solve in such network and the high
efficiency place? However, it is not obvious in concurrent or the distributional system
situation, “solves the problem is all meanings”
2.4 Computing PageRank in a distributed system
21
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Lectured the net graph in distribution system's recent research work to divide
into messes up the website or the domain case. The net is molded takes many messes
up the network server. Is divided in net's ultra link two categories, the internal cut-off
link and the mutual server link. The internal server link is between the page link in the
server, and these links use in calculating on each server's place PageRank intermediate
vector. The mutual server link is between the page link with the different server, and
they use in calculating ServerRank. ServerRank surveys the different network server's
relative importance. The server which submits is being merged finally from many
network server's result causes an arrangement ultra link name list.
The ranking algebra proposed that deals with the ranking in the different
granularity level, is utilized possibly also in gathering the place ranking and the stand
ranking obtains the global ranking. Has in one disperses the system fully in the
PageRank approximation work, each of the same generation is autonomous, and
perhaps of the same generation mutually overlaps. Was proposing the JXP algorithm,
each of the same generation calculates the place PageRank score, then meets other of
the same generations and increases it gradually through the exchange information
willfully about the global net graph knowledge, then recomputation in place of the
same generation's PageRank score.
This conference and the recomputation process is duplicated, collects the
enough information until of the same generation. If of the same generation meets the
sufficient number of times exchange information finally, JXP score polymerization to
the real global PageRank score. Supposes is each page of out degree in global graph
awareness. However, these operations are providing the approximation the focal point
are the global graph, in centralized system or distribution system.
2.5. tf-idf
The tf–idf weight (term frequency–inverse document frequency) is a weight
often used in information retrieval and text mining. This weight is a statistical measure
used to evaluate how important a word is to a document in a collection or corpus. The
importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Variations of the tf–
idf weighting scheme are often used by search engines as a central tool in scoring and
ranking a document's relevance given a user query.
One of the simplest ranking functions is estimated by summing the tf-idf for
each query term; many more sophisticated ranking functions are variants of this simple
model.
Motivation
Suppose we have a set of English text documents and wish to determine which
document is most relevant to the query "the brown cow." A simple way to start out is
by eliminating documents that do not contain all three words "the," "brown," and
"cow," but this still leaves many documents. To further distinguish them, we might
count the number of times each term occurs in each document and sum them all
together; the number of times a term occurs in a document is called its term frequency.
22
Research on Node Ranking – Peer to Peer …. Hoàng Cường
However, because the term "the" is so common, this will tend to incorrectly emphasize
documents which happen to use the word "the" more, without giving enough weight to
the more meaningful terms "brown" and "cow". Also the term "the" is not a good
keyword to distinguish relevant and non-relevant documents and terms like "brown"
and "cow" that occur rarely are good keywords to distinguish relevant documents from
the non-relevant documents. Hence an inverse document frequency factor is
incorporated which diminishes the weight of terms that occur very frequently in the
collection and increases the weight of terms that occur rarely.
Mathematical details
The term count in the given document is simply the number of times a given
term appears in that document. This count is usually normalized to prevent a bias
towards longer documents (which may have a higher term count regardless of the
actual importance of that term in the document) to give a measure of the importance of
the term ti within the particular document dj. Thus we have the term frequency,
defined as follows.
where ni,j is the number of occurrences of the considered term (ti) in document dj, and
the denominator is the sum of number of occurrences of all terms in document dj.
The inverse document frequency is a measure of the general importance of the
term (obtained by dividing the total number of documents by the number of
documents containing the term, and then taking the logarithm of that quotient).
with
• | D | : total number of documents in the corpus
• : number of documents where the term ti appears (that is
). If the term is not in the corpus, this will lead to a division-by-zero. It
is therefore common to use
Then
A high weight in tf–idf is reached by a high term frequency (in the given document)
and a low document frequency of the term in the whole collection of documents; the
weights hence tend to filter out common terms. The tf-idf value for a term will always
be greater than or equal to zero.
23
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Example
Consider a document containing 100 words wherein the word cow appears 3 times.
Following the previously defined formulas, the term frequency (TF) for cow is then
0.03 (3 / 100). Now, assume we have 10 million documents and cow appears in one
thousand of these. Then, the inverse document frequency is calculated as log(10 000
000 / 1 000) = 4. The TF-IDF score is the product of these quantities: 0.03 × 4 = 0.12.
Chapter 3:
Building a new algorithm for ranking nodes in chord
networks
3.1 Targets and Missions of Research
The P2P deployment is the building distributed search network. Proposed that
the system support discovers with retrieval all results, but lacks the essential
information to arrange them. User, however, is mainly to most is related and is not
all most possible results to be interested. The use random sampling, we expand the
well-known information retrieval ranking algorithm class they may apply like this in
this distributed establishment. How do we analyze the ceiling our method, and the
quota our system scale along with the document, the system size, the inquiry
document to ties the mapping correctly (uniform to non-uniform) and the type
increases the digit (rarely to universal deadline). Our analysis and the imitation
indicated that a) these extensions are the high efficiency, and possibly calls with a
ceiling to the large-scale system, and b) uses the result accuracy which and the
centralized implementation the distributed ranking obtains is comparable.
3.2 Idea:
24
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Table 3.2.1: The Pagerank converge and HITS converge
When I write a small code, I see that’s the converges of two algorithms: HITS-
Pagerank. HITS’s better converge more than Pagerank. It’s can be see as the
experiment’s image..
experiment test with Graph 1000 nodes.
The blue line is the converge ( iterators ) of Pagerank
The red line is the converge ( iterators ) of HITS
Table 3.2.2: The Pagerank converge increasing to fast
25
Research on Node Ranking – Peer to Peer …. Hoàng Cường
I also see that’s the converges of the algorithm: Pagerank.
It’s can be see as the experiment’s image that’s the time to calculate converge
of Pagerank algorithm. It goes to much faster. It can’t be done in peer to peer network
( the fast increase of time is more than the fast increase of graph’s size a lot )
The When I write a small code, I see that’s the converges of two algorithms:
HITS-Pagerank. HITS’s better converge more than Pagerank. It’s can be see as the
experiment’s image..
The Graph with 1000 nodes.
The blue line is the converge ( iterators ) of Pagerank
The red line is the converge ( iterators ) of HITS
Image taken from the experimental results, the convergence of the Pagerank
algorithm. 1000 nodes network simulation, 2000 nodes, 4000 nodes. (Execution time
calculation)
Light blue line is the only way to calculate the time convergence of the
Pagerank algorithm is at the network node 4000. Dark blue line is the only way to
calculate the time convergence of the Pagerank algorithm is at the network node
2000. The red dot is the only way to calculate the time convergence of the Pagerank
algorithm is at the network node 1000.
Computing time increases when the number of exponential increase network node
Want to calculate all the nodes in the network takes several performance (larger
network node -> lose greater efficiency (greater than performance node added)
Table 3.2.3: Pagerank convergence are not steady when Epsilon small
Image taken from the experimental results of the convergence Pagerank
algorithm. Network Simulation 1000 node. (Execution time calculation)
26
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Dark blue line includes the red dot is the only way to calculate the time
convergence of the Pagerank algorithm is at the network node 1000 under the
different conditions randomly.
With a little error Epsilon, convergence of k Pagerank is completely
stable. (Change gap)
3.2.4: HITS convergence ( steady+ take lots of time more than Pagerank)
Image taken from the experimental results the convergence of HITS
algorithm. Network simulation is 1000 nodes. (Terms Iterator going to converge).
Dark blue line includes the red dot is the only way to calculate the time
convergence of HITS algorithm is at 1000 node network under the different conditions
randomly.
With a little error Epsilon, convergence of HITS stable than PageRank.
Idea:
Combined HITS and Pagerank. HITS method n nodes filter out content
authentication best results) and used to calculate the Pagerank that n nodes. (n very
little compared to the total number of network nodes)
Advantages of this new approach:
• Exact result
• Accurate
• Computing easier.
• Easy feasible
• faster
Let’s going to see what’s happened to get a system which has some features like that
in deeply later. Basically, let’s analyze a simple example in Google search engine:
27
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 3.2: Google almost is not exact
In the results we can see that the topic results. They’re different, and no relate. And so
some are exact, but some results are not exactly ( many ).
3.2 Ranking Idea:
Customized semantic query answering, personalized search (Google is the first
of the crucial search engines to take personalized results on a massive scale. Weighing
a number of factors including but not limited to user history, bookmarks, community
behavior and site click-through rate and stickiness, Google is providing results that are
specific to what they believe you are searching for. Currently this service is only
available to those who are logged into their Google account), focused crawlers and
localized search engines frequently focus on ranking the Nodes contained within a
sub-graph of the global Peer to Peer Nodes graph. The claim for these utilizations is to
estimate PageRank-style scores expeditiously on the sub-graph, i.e., the ranking must
manifest the global link structure of the Peer to Peer Node graph exactly ( which
means calculating returns true global pagerank-scored order ) but it must do so without
bearing the high overhead affiliated to global computation. We propose a framework
of an exact unraveling and an analogous solution for computing ranking on a sub-
graph. without running PageRank on the whole Nodes Graph. So, Approaching sub-
graph and multiple sub-graphs respectively is the main discovery in this ranking
research.
28
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 3.2.2: Intersect idea
The Rank_local_idea algorithm is an rigorous scientific method with the
supposing that the scores of “outside” Nodes are known. We analyse that the
Rank_local_idea values for nodes in the sub-graph converge. Since the PageRank-
style values of out nodes may not expectedly be achievable, we introduce the
Pagerank_local algorithm to evaluate score results for the sub-graph.
We use graph EG ( Outside Graph ) to symbolize a major graph. Both
Rank_local_idea and Pagerank_local put on the set of outside nodes with an outside
node graph EG and magnify the sub-graph with links to graph EG. They also modify
the PageRank-style transition matrix ( The transition possibility distribution be
symbolizeed by a matrix ) with respect to graph EG.
We analyze the L1 distance between Rank_local_idea scores and
Pagerank_local scores of the sub-graph and show that it is within a constant factor of
the L1 distance of the outside nodes (e.g., the true PageRank scores and uniform
scores assumed by Pagerank_local). We compare Pagerank_local and a stochastic
complementation approach (SC), a current best solution for this problem, on different
types of sub-graphs. Pagerank_local has similar or superior performance to SC and
typically improves on the runtime performance of SC by an order of magnitude or
better. We demonstrate that Pagerank_local provides a good approximation to
PageRank for a variety of sub-graphs.
3.3 Idea finding a subset
Many assignments have to find salient and diverse items to satisfy the user’s
information need. This problem has been addressed in document summarization,
information retrieval, and various data mining applications. In this paper, we present a
new method that can select salient and diverse items from many information contents.
We formulate the mission as a graph summarization problem: given a weighted graph,
how can we find a subset of salient and diverse vertices to summarize the graph,
subject to some pre-specified constraints. We present a linear programming based
approximate algorithm to optimize an objective function which takes into account both
salience and diversity. The method was applied to two missions:
29
Research on Node Ranking – Peer to Peer …. Hoàng Cường
• multi-document summarization, which brings out salient sentences from
sentence graphs;
• mining symbolizeative experts in the data-mining community from co-author
graphs. In comparison to the state-of-the-art graphed based methods, our
method produces superior results in these applications.
3.4: Finding ranking factors
Image 3.4: Factor Percent
30
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Chapter 4:
Ranking algorithm details
4 .1: Idea
The explosion of files sharing available on the Peer to Peer Networks has made
the ranking of peer to peer systems an high-priced but unavoidable component. The
more users try to search, the more system overloads. Since hyper-bandwidth-links
from one node to another usually points to an “Authorization” or “Commendation”,
bandwidth-link analysis plays a essential role in elect the “importance” of nodes in
network. PageRank and HITS are two foremost approaches in this area. PageRank
iteratively estimates the score of a page based on the scores of its parent
Image 4.1: Bandwidth is the key of ranking trusted
Data as above equation
31
Research on Node Ranking – Peer to Peer …. Hoàng Cường
HITS separates the role of each node into a hub or authority. In the HITS
algorithm, the first step is to retrieve the set of results to the search query. The
computation is performed only on this result set, not across all Graph Nodes.
Authority and hub values are defined in terms of one another in a mutual
recursion. An authority value is computed as the sum of the scaled hub values that
point to that page. A hub value is the sum of the scaled authority values of the pages it
points to. Some implementations also consider the relevance of the linked nodes.
The algorithm performs a series of iterations, each consisting of two basic
steps:
• Authority Update: Update each node's Authority score to be equal to the
sum of the Hub Scores of each node that points to it. That is, a node is
given a high authority score by being linked to by nodes that are
recognized as Hubs for information.
• Hub Update: Update each node's Hub Score to be equal to the sum of
the Authority Scores of each node that it points to. That is, a node is
given a high hub score by linking to nodes that are considered to be
authorities on the subject.
The hub score estimates the value of its bandwidth-links to other nodes and the
authority score estimates the importance of the node. These algorithms are expensive
because of the number of node data/objects involved in the computation. On 15
November 2008, The Pirate Bay announced that it had reached over 25 million unique
peers. And according to the Pirate Bay statistics, as of December 2009, The Pirate Bay
has over 4 million registered users, although registration is not necessary to download
torrents. To make ranking feasible, and to manifest the diversity of node users'
information needs, peer to peer networking applications such as semantic search
("Semantic search seeks to improve search accuracy by understanding searcher intent
and the contextual meaning of terms as they appear in the searchable dataspace,
whether on the Web or within a closed system, to generate more relevant results.
Author Seth Grimes lists "11 approaches that join semantics to search", and
Hildebrand et al. provide an overview that lists semantic search systems and identifies
other uses of semantics in the search process.), focused crawlers (A focused crawler or
topical crawler is a web crawler that attempts to download only web data that are
relevant to a pre-defined topic or set of topics. Topical crawling generally assumes that
only the topic is given, while focused crawling also assumes that some labeled
examples of relevant and not relevant data are available. ), localized search engines,
and personalized search have emerged. They all have a common objective to rank a
sub-graph.
The first intriguing application is a focused crawler, also called a topical
crawler. A focused crawler is interested in collecting a subset of the node data that are
related to a specific topic. Compared to a standard crawler which can easily get lost
and waste resources, a focused crawler acquires relevant data using a Best First
Search; it selects links based on their scores. In contrast to focused crawlers which are
32
Research on Node Ranking – Peer to Peer …. Hoàng Cường
topic specific, a localized search engine indexes a subset of node data that are within a
specific domain. The node piece retrieved by the focused crawler (or localized search
engine) is a sub-graph of the global node graph. Only PageRank scores for local data
in the sub-graph are of interest to users. Users submit queries to the sub-graph
collected by a focused crawler and local query answers are returned to the user. The
ranking on this local graph, however, should reflect the link structure of all node data.
Another interesting scenario is semantic ranking. ObjectRank creates a schema
graph to model the semantic connections between entity sets, e.g., authors or
conferences. The semantic connections are associated with an authority transfer
assignment which can be arbitrarily set by a domain expert based on her interpretation
of the domain. Figure 2 shows an authority transfer schema graph for DBLP.
Building on previous work on how to model contextual information for desktop
search and how to implement semantically rich information exchange in social
networks. Peer-Sensitive ObjectRank for ranking resources on the desktop. The
algorithm takes into account different trust values for each peer, generalizing previous
biasing PageRank algorithms.
The ObjectRank system applies the random walk model, the effectiveness of
which is proven by Google's PageRank, to keyword search in databases modeled as
labeled graphs. The system ranks the database objects with respect to the user-
provided keywords.
The PageRank technique assigns to each page p a score based on the score of
the pages pointing to p. Hence, pages pointed by many high-score pages receive a high
score as well. Alternatively, the score of p is equal to the possibility that a random
surfer, starting from a random page, will be at p at a specific time. For more
information on PageRank see the original PageRank paper.
While ObjectRank is flexible and allows the tuning of ObjectRank scores by a
domain expert, it leads to computational challenges if a search engine has to consider
all possible combinations of keywords and authority transfer assignments.
Recent research on reformulating ObjectRank scores based on individual user
feedback and a graph exploration framework for the biological node highlights the
optimization challenges of query answering and ranking for the semantic Node. If we
can model a sub-graph to contain the subset of data associated with the entity sets of
interest to some domain expert, we can then define the ObjectRank problem as a
problem of ranking a sub-graph. This problem, too, is to exploit existing PageRank
scores for other regions of the graph that are not of interest to the domain expert, and
whose scores may also remain largely unchanged. Figure 3 shows an example where a
sub-graph is associated with an authority transfer assignment and the outside Node
are beyond the focus of the domain expert.
33
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Another application that involves ranking a sub-graph is peer-to-peer networks.
The advent of peer-to-peer(P2P) technology has further encouraged data information
retrieval by jump on distributed computing power, storage, and connectivity. A
distributed or decentralized system has multiple peers or servers, each of which puts in
storage its own sub-graph of the Peer to Peer nodes. A user may take queries on one
peer and ranked query answers that are available locally are presented to the user. The
ranking relies upon the context of the query.
A final scenario is a absorption of the constant change of the Peer to Peer nodes
- Peer to Peer node data. The ranking of data needs to be updated frequently,
especially for the sub-graph of the nodes that experiences the most change. This sub-
graph can be either a set of dangling files that crawlers have not as yet crawled,
referred to as the web “frontier”, or the set of data that are most affected by updates. It
is desirable that any strategy to update the ranking of this sub-graph exploits existing
PageRank scores for other regions of the graph which may remain largely unchanged.
In response to these many motivating applications, we address the problem
of computing ranking scores for a sub-graph. For ease of presentation, and to
compare with existing approaches, we use the PageRank metric for explanation
and experiments. However, our general approaches can be applied to estimate
ObjectRank scores as well.
We note that current ranking techniques (to be discussed in the next
section) either bear the cost of a global computation to get an accurate ranking,
or they have to solve another potentially difficult problem: to determine a
relevant super-graph of web data that impact the rank of the sub-graph. Our
challenge is to obtain an accurate ranking that reflects the global link structure of
the Web graph and to do so without bearing the high overhead associated with a
global PageRank computation or having to solve the difficult problem of
identifying a relevant supergraph. We would also like to exploit pre-estimated
34
Research on Node Ranking – Peer to Peer …. Hoàng Cường
PageRank scores for outside data if and when they are available and appropriate
for use.
We propose a framework based on an exact and an approximate solution to
estimate PageRank on a sub-graph. The Rank_local_idea algorithm is an exact
solution. It assumes that the PageRank scores of outside data are known. We prove
that the Rank_local_idea scores for data in the sub-graph converge to the true
PageRank scores.
Since the PageRank scores of outside data may not typically be available, we
propose the Pagerank_local algorithm to estimate PageRank scores for the sub-graph.
Both Rank_local_idea and Pagerank_local symbolize the set of outside data with an
outside node graph EG and extend the sub-graph with links to graph EG. They also
modify the PageRank transition matrix with respect to (the links to) graph EG.
The Rank_local_idea and Pagerank_local framework formalizes the problem of
ranking a sub-graph. It allows us to model multiple scenarios where ranking a sub-
graph is important. Rank_local_idea can be used to model scenarios where PageRank
scores of the global graph are known a priori and can potentially be re-used. This
includes the case where the sub-graph symbolizes the data that have been updated, or
the sub-graph symbolizes the data that contain all the semantic types of interest to a
domain expert for a personalized or semantic ranking such as ObjectRank.
Pagerank_local can be applied in general to all these problems, when we do not know
the PageRank scores of outside data.
We compare our approach with the stochastic complementation (SC) approach.
SC builds a super-graph by carefully examining candidate outside data and adding
them into the super-graph if adding this page has a significant influence on the
PageRank scores of the sub-graph. Our approach in contrast models the outside data
using a node graph EG, and it can be used in situations when a super-graph cannot be
obtained easily. Our approach also avoids the cost of a global computation. The
Pagerank_local computation is also much cheaper than SC since SC bears the high
cost of constructing the super-graph.
We experimentally study the effect of size and type of the sub-graphs on the
accuracy of Pagerank_local. We study several types of sub-graphs including domain
specific sub-graphs, topic specific sub-graphs, and sub-graphs gathered by a Breadth
First Search crawler. We compare our results with SC and two baseline ranking
algorithms; one was discussed in [18], and the other is local PageRank on the sub-
graph (ignoring the outside data). We show that Pagerank_local has similar or superior
ranking accuracy to SC and typically its runtime performance is an order of magnitude
better than SC. Pagerank_local also outperforms the two baseline algorithms on
ranking accuracy. Our contributions are as follows:
• We define an efficient algorithm, Rank_local_idea, to estimate PageRank
scores for a sub-graph when PageRank scores of the outside data are known.
The random walk defined by Rank_local_idea utilizes these scores.
• We prove that the Rank_local_idea scores converge to the true PageRank
scores for all local data in the sub-graph, and the Rank_local_idea score for the
outside node graph EG converges to the sum of true PageRank scores for all
outside data.
35
Research on Node Ranking – Peer to Peer …. Hoàng Cường
• When PageRank scores of outside data are not known, we define an efficient
algorithm Pagerank_local. We provide important properties of Pagerank_local
scores.
• We show through empirical results that the Pagerank_local ranking accuracy is
similar (sometimes superior) to the best competitor SC, and it overwhelmingly
outperforms the runtime efficiency of SC.
4.2 Ranking - RANK_LOCAL_IDEA APPROACH
Image 4.2: Eigenvalue
We officially define the Rank_local_idea algorithm to estimate PageRank
scores for a local graph. Our approach is passional by research on collapsing matrices
with the same eigenvector. (If x is an eigenvector of the linear transformation A with
eigenvalue λ, then any scalar multiple αx is also an eigenvector of A with the same
eigenvalue. Similarly if more than one eigenvector shares the same eigenvalue λ, any
linear combination of these eigenvectors will itself be an eigenvector with eigenvalue
λ.)
Rank_local_idea performs a random walk( Often, the walk is in discrete time,
and indexed by the natural numbers, as in . However, some walks
take their steps at random times, and in that case the position Xt is defined for the
continuum of times ) on a modified local graph called the extended local graph,
where an outside node of graph EG is added to the local graph. graph EG symbolizes
the set of data that are not local.
36
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 4.2.2: Random walk
The transition matrix probabilities of Rank_local_idea are derived from the
transition matrix of PageRank for the global graph. Rank_local_idea assumes that the
PageRank scores of all outside data in graph EG are known. This assumption will be
relaxed in the next section where we present an approximate solution. Consider two
graphs; a global graph of size N, and a local graph of size n. The local graph is a sub-
graph of the global graph. The data in the local graph are called local data while data
in the global graph and that are not in the local graph are called outside data. The goal
is to provide the true PageRank for the local graph without running PageRank on the
global graph.
List the symbols used to define our algorithms.
EG : outside nodes, the artificial node symbolizeing all outside nodes.
G1 A sub-graph of the graph nodes with n nodes
Gg The global nodes graph with N nodes.
Ge The extended local graph with n + 1 nodes
The Rank_local_idea algorithm
There are edges between the graph and the local node edge outside knot
according to the global character. However, this kind of explanation is
impossible to distinguish like sees at a link instance or between the place page
and between exterior data many links in the below example:
37
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Let Figure 4 be a global graph. Node A,B,C, and D are local data, and node X,
Y and Z in the cloud are outside data. Figure 5 provides an example of adding an
outside node to symbolize the exterior data
The edges added from the local data to the outside node do not need
the strategy to modify the primitive PageRank transition matrix to reflect each
such edge to symbolize in global graph many edges.
When computing the standard Pagerank algorithm on this graph, the possibility
continuance from a node is proportional to the inverse of its out-degree. node C which
has 3 incoming edges from the outside data is treated similarly to node D which has
only 1 incoming edge from the outside data. Intuitively, however, we should expect a
higher possibility of following links from the outside data to node C. Similarly, the
possibility of following links from page A to graph EG is 1=3. This too is lower than
the transition possibility based on the global graph.
Rank_local_idea addresses this shortcoming with the following solution: The
first step is to add an outside node graph EG to the sub-graph to symbolize all outside
data. The second step is to construct the extended local graph Ge, the graph EG
enriched graph of size n+ 1. There is an edge from graph EG to a local page in Ge if
there is an edge from an outside page to that local page. The same hold for edges out
of local data. Similarly, there is an edge from graph EG to graph EG if there is an edge
between outside data. The next step is to define a transition matrix A_deal_matrix and
a personalization vector Pideal.
Let’s consider a little about the jargon “personalization vector”. At each time
step, with possibility (1 −α), a surfer visiting any node will jump to a random Web
Node (rather than following an outlink). The destination of the random jump is chosen
according to the possibility distribution given in v. For this reason, we refer to v as the
personalization vector Example of personalization vector value in Pagerank formula:
38
Research on Node Ranking – Peer to Peer …. Hoàng Cường
The details will be discussed next section. Finally, a random walk is performed
on Ge. The Rank_local_idea vector Rank_ideal is defined as follows:
Algorithm Rank_local_idea (Gl, Gg)
1. Add outside node graph EG to G1.
Image 4.2.4: (n+1) graph nodes
2. Create edge associated with the graph graph EG and get Ge.
3. Assign values to P_ideal and A_ideal.
4. Perform a random walk on the extended local graph according to Formula.
Although swarming scales well to tolerate flash crowds for popular content, it is less useful for
unpopular content. Peers arriving after the initial rush might find the content unavailable and need to
wait for the arrival of a seed in order to complete their downloads. The seed arrival, in turn, may take
long to happen, since maintaining seeds for unpopular content entails high bandwidth and
administrative costs, which runs counter to the goals of publishers that value BitTorrent as a cheap
alternative to a client-server approach. A strategy adopted by many publishers which significantly
increases availability of unpopular content consists of bundling multiple files in a single swarm.
So it’s only way as the key of ranking system – bandwith analyze.
A_deal_matrix and Pideal
We define an (n+1) x (n+1) transition matrix A_deal_matrix and a length (n+1)
personalization vector Pideal. Let A symbolise the N x N transition matrix for
PageRank on the global graph. Entry A[i,j] has the value of the inverse out-degree of
page i, if there is an edge (i; j); the value is the expectation of a random walk
following this edge from i. Without lack the generality, we assume the local data to be
the first close n data in A and the outside data are indexed from n + 1 to N in A.
Assume that the PageRank scores for all outside data are known. The values are
PageRank (node n+1, node n+2, …, node N ) = { R[n + 1];R[n + 2]; … ;R[N] }
39
Research on Node Ranking – Peer to Peer …. Hoàng Cường
(graph Ge có các node từ 1-> n node, và outside graph có (n+1 -> N) node)
respectively. Let
A_ideal is defined as follows, based on the entries in the original PageRank
transition matrix A:
Next we explain the elements in A_deal_matrix. These values are as follows:
1. The n x n sub-matrix at upper left is identical to the equating factors in
transition matrix A for the global graph. They symbolize the possibility of
transition between edges in the local graph.
2. The n x 1 sub-matrix at upper right symbolizes the possibility continuance
from a local page to the node graph EG. We note that the possibility of
reaching graph EG is the sum of the possibility of reaching any outside node
from the local node. For local node k, the value is
3. The 1 x n sub-matrix at lower left represents to the possibility
continuance from graph EG to local data. For local node k, the value is
4. The entry at the lower right corner denotes the possibility continuance from
graph EG to graph EG.
The last row has entries that are each a weighted sum of probabilities summed
over all outside data. The weight is determined by the PageRank score of the outside
node. This is a key feature of A_deal_matrix and will be discussed next.
We define A_deal_matrix formally as follows:
A_deal_matrix = Q1AQ2
40
Research on Node Ranking – Peer to Peer …. Hoàng Cường
where Q1 is an (n+1) x N matrix and Q2 is an N x (n+1) matrix. Let Q2 be an
N x (n + 1) matrix as follows:
where In is an n x n identity matrix, B is an n x 1 0-matrix, C is a (N - n) x n 0-
matrix, and D is a (N -n) x 1 matrix with all 1's.
The effect of AQ2 on the ranking vector is to aggregate the authority
continuance from local data to all outside data, which indicates the authority goes to
graph EG.
Let Q1 be the following (n + 1) x N matrix:
where In is an n x n identity matrix, CT is an n x (N - n) 0-matrix and BT is a 1 x
n 0-matrix.
The matrix of interest is E, a 1 x (N - n) matrix. It considers the PageRank
scores for all outside data. Recall that Sum is the sum of PageRank scores for all
outside data
Then, E can be expressed as follows:
Let’s take an example:
41
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 4.2.5: Graph example - 6 nodes
the transition Matrix
Ranking value by using Pagerank-Formula:
42
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Q1 is an N x ( n + 1) matrix:
Assume N = 6, n = 4;
In matrix
Q2 Matrix
43
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Tương tự:
Giả sử 2 node outside là node 5 và node 6.
N = 6, n = 4, 2 node outside.
R[5] = 0.206
R[6] = 0.2862
Sum = 0.206 + 0.2862 = 0.4922
Matrix E
E = ( , = (0.418529053, 0.581470947)
A_deal_matrix = Q2AQ1 =
44
Research on Node Ranking – Peer to Peer …. Hoàng Cường
=
[ 0. 0.5 0.5 0. 0. ]
[ 0 0 0 0 0 ]
[ 1/3 1/3 0 0 1/3 ]
[ 0. 0 0 0. 1. ]
[ 0. 0 0 0.79073547 0.20926453]
Xét lại công thức :
Với
R[5] = 0.206
R[6] = 0.2862
A[5][5] = 0
45
Research on Node Ranking – Peer to Peer …. Hoàng Cường
A[5][6] =1/2
A[6][5] = 0
A[6][6] = 0
A[5][4] = 1/2
A[6][4] = 1
Sum = 0.4922
=
[ 0. 0.5 0.5 0 0. = 0+0 ]
[ 0 0 0 0 0 = 0 + 0 ]
[ 1/3 1/3 0 0 1/3 = 1/3+0 ]
[ 0. 0 0 0 1 = 1/2 + 1/2 ]
[ 0=0+0 0=0+0 0=0+0 0.79073547 0.20926453 ]
0.79073547 = (0.206*0.5+0.2826)/0.4922 (đúng)
0.20926453 = (0.206*0 + 0.206*1/2 + 0.2826*0+ 0.2826*0)/0.4922 ( đúng)
Và ma trận A
46
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 4.2.6: Multiplication result example
The idea of multiplying the values of entries in A with the two matrices Q1 and
Q2, where Q1 derived from the ranking vector for outside data, is key to the approach
of A_deal_matrix. It has the effect of distributing the possibility continuance from the
outside nodes, in a manner that is proportional to the importance of each of the outside
data in the original PageRank vector. Recall that the personalization vector in the
original PageRank is defined as a uniform vector
Instead, for Rank_local_idea we define the personalization vector Pideal
according to the number of outside data and total number of data in the graph. More
specifically, the i-th entry of Pideal, Pideal[i] can be expressed as follows:
Summary, according to the equalization:
47
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Convergence of Rank_local_idea
Let Rank_ideal be the final ranking vector of Rank_local_idea, where the first n
elements are scores for local data and the (n+1)-th element is the score for the outside
node graph EG. We show that the scores of first n elements are identical to the true
PageRank scores.
Theorem 1: In Rank_ideal, scores for the first n data converge to the true PageRank
scores. The score for the (n + 1) th element, graph EG, converges to the sum of true
PageRank scores for all outside data.
Proof:
Let R be the true PageRank vector such that
R is the converged stationary distribution for A. Let R’ = be a vector with
n + 1 entries.
We also know that R = .
It is obvious that R’[i] = R[i] for first n elements and
We will show that R’ is the Rank_local_idea vector.
We know that Next consider a left multiply with to
obtain the following:
Since A_deal_matrix is stochastic and Markov Chain defined by
Rank_local_idea is irreducible and aperiodic, there is a unique stationary distribution
for A_deal_matrix. Therefore, R0 = Rank_ideal.
The Rank_local_idea algorithm addresses several applications. One is where
some sub-graph of the Web graph has been updated. A second case is when the
personalized authority transfer is limited to the sub-graph. In these cases, the
knowledge of PageRank scores can be potentially relied on to estimate new ranking
scores.
48
Research on Node Ranking – Peer to Peer …. Hoàng Cường
THE PAGERANK_LOCAL ALGORITHM
Unlike the previous scenario where PageRank values for outside data are
known, we now consider scenarios where the PageRank scores are not known a priori.
To cover this state, our framework has an approximate solution Pagerank_local. The
key difference is that for Pagerank_local, the algorithm is not able to differentiate the
(previously weighted) contribution of authority from each individual outside page
(since these PageRank scores are unknown). Instead, Pagerank_local will consider the
authority continuance from outside data assuming they are equally important. We
analyze the L1 distance between Rank_local_idea scores and Pagerank_local scores of
the sub-graph and reveal that it is within a constant factor of L1 distance between the
true PageRank scores and uniform scores of the outside data. We will show through
experiments that Pagerank_local is a good approximation.
A. The Pagerank_local algorithm
The Pagerank_local vector Rapprox is defined as follows
Pagerank_local adopts the same personalization vector as Rank_local_idea. It
however, defines its own transition matrix Matrix_A_approx.
B. Matrix_A_approx definition
Matrix_A_approx is an (n + 1) x (n + 1) matrix. It is defined as follows:
For example( the above graph)
The matrix A
49
Research on Node Ranking – Peer to Peer …. Hoàng Cường
New matrix Approx:
[ 0 0.5 0.5 0 0 ]
[ 0 0 0 0 0 ]
= [1/3 1/3 0 0 1/3 ]
[ 0 0 0 0.5 1 ]
[ 0 0 0 0.75 1/4]
Calculating local-pagerank
Choose alpha = 0.85 ( according to Pagerank )
At iterator 0:
Rapprox = [1/6, 1/6, 1/6, 1/6, 1/3]
Pideal = [1/6, 1/6, 1/6, 1/6, 1/3]
At iterator 1:
Rapprox = 0.85* *R approx + 0.25* [0.25, 0.25, 0.25, 0.25, 0.5]
…
Program results after 10 iterators:
50
Research on Node Ranking – Peer to Peer …. Hoàng Cường
Image 4.2.7 : Multiplication result example at iterators
We Can be clearly seen:
Order ( node 1-> 4)
And order
Are the same.
Matrix_A_approx is different from A_deal_matrix in the last row, since
Rank_local_idea does not utilize knowledge about PageRank scores of outside data in
the first n rows.
For the first n entries in the last row, the value symbolizes the (average)
possibility continuance accumulated from (N - n) outside data to each local page. The
last entry in this n-th row of the matrix is the (average) possibility continuance from
outside data to other outside data. Similar to A_deal_matrix = Q1AQ2,
Matrix_A_approx can be formally defined as Matrix_A_approx = Q’ 1AQ2, where the
vector E is replaced by a vector Eapprox in Q’
51
Research on Node Ranking – Peer to Peer …. Hoàng Cường
In approx, the values at the last row are as follows:
1) For the first n values, (1 <= k <= n), the possibility from graph EG to a local page k
is assigned the summation of
continuance from all outside data to k, divided by the number of outside data. For
local page k, it is
2) For the (n+1)-th value, the possibility for the self-loop edge is determined by the
total authority continuance among outside data, divided by the number of outside data.
Given the global graph example in Figure 4, the probabilities assigned by
Matrix_A_approx are shown in Figure 6. We provide some examples of edge weight
calculation following these rules. According to rule 1, the authority continuance on
edge AB, AC, CB, BD, CD, DA are the outdegree inverse. Since A points to page X,
Z, the authority continuance on edge (A;graph EG) is 1/2. The authority continuance
on edge (graph EG;C)
The self-loop edge authority continuance will be
An advantageous quality about Pagerank_local is that it is suitable to adopt
precomputation for various sub-graphs. With the same global graph, Approx can be
figured out easily from the difference between the local values and the global values.
This is especially beneficial for applications where there are multiple sub-graphs.
52
Research on Node Ranking – Peer to Peer …. Hoàng Cường
53
Pagerank_local scores converge to a unique vector R approx. There are two reasons.
First, the transition matrix
is a column stochastic matrix, as the sum of each column is 1. Second, since we
complement the random walk with jumps from dangling data, the Markov Chain we
defined is irreducible and aperiodic. RANK satisfies the two conditions of being
irreducible and aperiodic of the E
rgodic Theorem for Markov chains. Next we will
vestigate how close is Rapprox to Rank_ideal, which we have shown to be the true
ageRank scores for local data.
in
P
Research on Node Ranking – Peer to Peer …. Hoàng Cường
54
Chapter 5
Evaluation
The
results show that our approach is robust to ranking, and converge fast, feasible.
I evaluate our ranking technique in a simulator using real document sets from
the small program simulation – peer to peer networks. As below is some results.
Table 5.1: the number of iterators which local_pagerank converges
It’s Depending on the number of nodes need to rank. But is limited.
Research on Node Ranking – Peer to Peer …. Hoàng Cường
55
Chapter 6.
Related Work
raditional centralized
pproaches, and search strategies over structured P2P networks.
Our paper builds on prior work on efficient lookup and storage schemes. We
assume the existence of a lookup protocol provided by the underlying system. Such
lookup protocols have been studied in detail both in a structured setting (e.g., Chord,
Pastry, Kademlia, Viceroy and Skipnet).
Providing a useful search facility has been an important area of research. Prior
work in searching can broadly be classified into two categories: t
a
Research on Node Ranking – Peer to Peer …. Hoàng Cường
56
Chapter 7.
Conclusions and Future Work
h
s based on asynchronous iteration
•
ur
hile our
ys oaches.
We plan to investigate these approaches in our future work.
7.1 Contributions
In this paper, we have presented a distributed algorithm for ranking searc
results. Our solution demonstrates that distributed ranking is feasible with little
network overhead. Some of contributions:
• Distributed computation of Pagerank
o Application in P2P systems
o Application on Internet scale
Very large scale asynchronous iteration computation
There are several areas worthy of further investigation. Performance could
potentially be improved by mechanisms such as relevance feedback and caching. O
analysis could be extended to account for popularity-based replication. W
s tem provides a first solution to distributed ranking, other appr
Research on Node Ranking – Peer to Peer …. Hoàng Cường
57
References
1. Google. Google search engine.
Sergey Brin, Lawrence Page. “The Anatomy of a Large-Scale
Hypertextual
2.
Web Search Engine,” In Proc. 7th Int. World Wide Web
and
val.
R etrieval, pp. 334-342.
4. Wikipedia.
Conf., 1998.
3. Amy N. Langville & Carl D. Meyer (2006) Google's PageRank
Beyond: The Science of Search Engine Rankings information retrie
esearch and Development in Information R
Các file đính kèm theo tài liệu này:
- KLTN_HoangCuong_06020044.pdf