Cơ sở dữ liệu - Chapter 19: Information retrieval
Solution: use number of hyperlinks to a site as a measure of the popularity or
prestige of the site
● Count only one hyperlink from each site (why? see previous slide)
● Popularity measure is for site, not for individual page
But, most hyperlinks are to root of site
Also, concept of “site” difficult to define since a URL prefix like
cs.yale.edu contains many unrelated pages of varying popularity
■ Refinements
● When computing prestige based on links to a site, give more weight to
links from sites that themselves have higher prestige
Definition is circular
Set up and solve system of simultaneous linear equations
● Above idea is basis of the Google PageRank ranking mechanism
25 trang |
Chia sẻ: huyhoang44 | Lượt xem: 667 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Cơ sở dữ liệu - Chapter 19: Information retrieval, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Database System Concepts
©Silberschatz, Korth and Sudarshan
See www.dbbook.com for conditions on reuse
Chapter 19: Information Retrieval
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Chapter 19: Information Retrieval
n Relevance Ranking Using Terms
n Relevance Using Hyperlinks
n Synonyms., Homonyms, and Ontologies
n Indexing of Documents
n Measuring Retrieval Effectiveness
n Web Search Engines
n Information Retrieval and Structured Data
n Directories
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Information Retrieval Systems
n Information retrieval (IR) systems use a simpler data model than
database systems
l Information organized as a collection of documents
l Documents are unstructured, no schema
n Information retrieval locates relevant documents, on the basis of user
input such as keywords or example documents
l e.g., find documents containing the words “database systems”
n Can be used even on textual descriptions provided with nontextual
data such as images
n Web search engines are the most familiar example of IR systems
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Information Retrieval Systems (Cont.)
n Differences from database systems
l IR systems don’t deal with transactional updates (including
concurrency control and recovery)
l Database systems deal with structured data, with schemas that
define the data organization
l IR systems deal with some querying issues not generally
addressed by database systems
Approximate searching by keywords
Ranking of retrieved answers by estimated degree of relevance
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Keyword Search
n In full text retrieval, all the words in each document are considered to be
keywords.
l We use the word term to refer to the words in a document
n Informationretrieval systems typically allow query expressions formed using
keywords and the logical connectives and, or, and not
l Ands are implicit, even if not explicitly specified
n Ranking of documents on the basis of estimated relevance to a query is critical
l Relevance ranking is based on factors such as
Term frequency
– Frequency of occurrence of query keyword in document
Inverse document frequency
– How many documents the query keyword occurs in
» Fewer give more importance to keyword
Hyperlinks to documents
– More links to a document document is more important
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Relevance Ranking Using Terms
n TFIDF (Term frequency/Inverse Document frequency) ranking:
l Let n(d) = number of terms in the document d
l n(d, t) = number of occurrences of term t in the document d.
l Relevance of a document d to a term t
The log factor is to avoid excessive weight to frequent terms
l Relevance of document to query Q
n(d)
n(d, t)
1 +TF (d, t) = log
r (d, Q) = ∑ TF (d, t)n(t)t∈Q
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Relevance Ranking Using Terms (Cont.)
n Most systems add to the above model
l Words that occur in title, author list, section headings, etc. are
given greater importance
l Words whose first occurrence is late in the document are given
lower importance
l Very common words such as “a”, “an”, “the”, “it” etc are eliminated
Called stop words
l Proximity: if keywords in query occur close together in the
document, the document has higher importance than if they occur
far apart
n Documents are returned in decreasing order of relevance score
l Usually only top few documents are returned, not all
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Similarity Based Retrieval
n Similarity based retrieval retrieve documents similar to a given
document
l Similarity may be defined on the basis of common words
E.g. find k terms in A with highest TF (d, t ) / n (t ) and use
these terms to find relevance of other documents.
n Relevance feedback: Similarity can be used to refine answer set to
keyword query
l User selects a few relevant documents from those retrieved by
keyword query, and system finds other documents similar to these
n Vector space model: define an ndimensional space, where n is the
number of words in the document set.
l Vector for document d goes from origin to a point whose i th
coordinate is TF (d,t ) / n (t )
l The cosine of the angle between the vectors of two documents is
used as a measure of their similarity.
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Relevance Using Hyperlinks
n Number of documents relevant to a query can be enormous if only term
frequencies are taken into account
n Using term frequencies makes “spamming” easy
E.g. a travel agency can add many occurrences of the words
“travel” to its page to make its rank very high
n Most of the time people are looking for pages from popular sites
n Idea: use popularity of Web site (e.g. how many people visit it) to rank
site pages that match given keywords
n Problem: hard to find actual popularity of site
l Solution: next slide
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Relevance Using Hyperlinks (Cont.)
n Solution: use number of hyperlinks to a site as a measure of the popularity or
prestige of the site
l Count only one hyperlink from each site (why? see previous slide)
l Popularity measure is for site, not for individual page
But, most hyperlinks are to root of site
Also, concept of “site” difficult to define since a URL prefix like
cs.yale.edu contains many unrelated pages of varying popularity
n Refinements
l When computing prestige based on links to a site, give more weight to
links from sites that themselves have higher prestige
Definition is circular
Set up and solve system of simultaneous linear equations
l Above idea is basis of the Google PageRank ranking mechanism
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Relevance Using Hyperlinks (Cont.)
n Connections to social networking theories that ranked prestige of people
l E.g. the president of the U.S.A has a high prestige since many people
know him
l Someone known by multiple prestigious people has high prestige
n Hub and authority based ranking
l A hub is a page that stores links to many pages (on a topic)
l An authority is a page that contains actual information on a topic
l Each page gets a hub prestige based on prestige of authorities that
it points to
l Each page gets an authority prestige based on prestige of hubs that
point to it
l Again, prestige definitions are cyclic, and can be got by
solving linear equations
l Use authority prestige when ranking answers to a query
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Synonyms and Homonyms
n Synonyms
l E.g. document: “motorcycle repair”, query: “motorcycle maintenance”
need to realize that “maintenance” and “repair” are synonyms
l System can extend query as “motorcycle and (repair or maintenance)”
n Homonyms
l E.g. “object” has different meanings as noun/verb
l Can disambiguate meanings (to some extent) from the context
n Extending queries automatically using synonyms can be problematic
l Need to understand intended meaning in order to infer synonyms
Or verify synonyms with user
l Synonyms may have other meanings as well
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
ConceptBased Querying
n Approach
l For each word, determine the concept it represents from context
l Use one or more ontologies:
Hierarchical structure showing relationship between concepts
E.g.: the ISA relationship that we saw in the ER model
n This approach can be used to standardize terminology in a specific
field
n Ontologies can link multiple languages
n Foundation of the Semantic Web (not covered here)
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Indexing of Documents
n An inverted index maps each keyword Ki to a set of documents Si that
contain the keyword
l Documents identified by identifiers
n Inverted index may record
l Keyword locations within document to allow proximity based ranking
l Counts of number of occurrences of keyword to compute TF
n and operation: Finds documents that contain all of K1, K2, ..., Kn.
l Intersection S1∩ S2 ∩..... ∩ Sn
n or operation: documents that contain at least one of K1, K2, , Kn
l union, S1∩ S2 ∩..... ∩ Sn,.
n Each Si is kept sorted to allow efficient intersection/union by merging
l “not” can also be efficiently implemented by merging of sorted lists
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Measuring Retrieval Effectiveness
n Informationretrieval systems save space by using index structures
that support only approximate retrieval. May result in:
l false negative (false drop) some relevant documents may not
be retrieved.
l false positive some irrelevant documents may be retrieved.
l For many applications a good index should not permit any false
drops, but may permit a few false positives.
n Relevant performance metrics:
l precision what percentage of the retrieved documents are
relevant to the query.
l recall what percentage of the documents relevant to the query
were retrieved.
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Measuring Retrieval Effectiveness (Cont.)
n Recall vs. precision tradeoff:
Can increase recall by retrieving many documents (down to a low
level of relevance ranking), but many irrelevant documents would be
fetched, reducing precision
n Measures of retrieval effectiveness:
l Recall as a function of number of documents fetched, or
l Precision as a function of recall
Equivalently, as a function of number of documents fetched
l E.g. “precision of 75% at recall of 50%, and 60% at a recall of 75%”
n Problem: which documents are actually relevant, and which are not
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Web Search Engines
n Web crawlers are programs that locate and gather information on the
Web
l Recursively follow hyperlinks present in known documents, to find
other documents
Starting from a seed set of documents
l Fetched documents
Handed over to an indexing system
Can be discarded after indexing, or store as a cached copy
n Crawling the entire Web would take a very large amount of time
l Search engines typically cover only a part of the Web, not all of it
l Take months to perform a single crawl
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Web Crawling (Cont.)
n Crawling is done by multiple processes on multiple machines, running
in parallel
l Set of links to be crawled stored in a database
l New links found in crawled pages added to this set, to be crawled
later
n Indexing process also runs on multiple machines
l Creates a new copy of index instead of modifying old index
l Old index is used to answer queries
l After a crawl is “completed” new index becomes “old” index
n Multiple machines used to answer queries
l Indices may be kept in memory
l Queries may be routed to different machines for load balancing
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Information Retrieval and Structured Data
n Information retrieval systems originally treated documents as a
collection of words
n Information extraction systems infer structure from documents, e.g.:
l Extraction of house attributes (size, address, number of bedrooms,
etc.) from a text advertisement
l Extraction of topic and people named from a new article
n Relations or XML structures used to store extracted data
l System seeks connections among data to answer queries
l Question answering systems
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Directories
n Storing related documents together in a library facilitates browsing
l users can see not only requested document but also related ones.
n Browsing is facilitated by classification system that organizes logically
related documents together.
n Organization is hierarchical: classification hierarchy
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
A Classification Hierarchy For A Library System
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Classification DAG
n Documents can reside in multiple places in a hierarchy in an
information retrieval system, since physical location is not important.
n Classification hierarchy is thus Directed Acyclic Graph (DAG)
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
A Classification DAG For A Library
Information Retrieval System
©Silberschatz, Korth and Sudarshan19.Database System Concepts 5th Edition, Sep 2, 2005
Web Directories
n A Web directory is just a classification directory on Web pages
l E.g. Yahoo! Directory, Open Directory project
l Issues:
What should the directory hierarchy be?
Given a document, which nodes of the directory are categories
relevant to the document
l Often done manually
Classification of documents into a hierarchy may be done
based on term similarity
Database System Concepts
©Silberschatz, Korth and Sudarshan
See www.dbbook.com for conditions on reuse
End of Chapter
Các file đính kèm theo tài liệu này:
- ch19_3181_2078.pdf