Example of a cache coherency protocol for shared disk systems:
● Before reading/writing to a page, the page must be locked in
shared/exclusive mode.
● On locking a page, the page must be read from disk
● Before unlocking a page, the page must be written to disk if it was
modified.
■ More complex protocols with fewer disk reads/writes exist.
■ Cache coherency protocols for sharednothing systems are similar.
Each database page is assigned a home processor. Requests to fetch
the page or write it to disk are sent to the home processor
43 trang |
Chia sẻ: huyhoang44 | Lượt xem: 835 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu Cơ sở dữ liệu - Chapter 21: Parallel databases, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Database System Concepts, 5th Ed.
©Silberschatz, Korth and Sudarshan
See www.dbbook.com for conditions on reuse
Chapter 21: Parallel Databases
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Chapter 21: Parallel Databases
n Introduction
n I/O Parallelism
n Interquery Parallelism
n Intraquery Parallelism
n Intraoperation Parallelism
n Interoperation Parallelism
n Design of Parallel Systems
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Introduction
n Parallel machines are becoming quite common and affordable
l Prices of microprocessors, memory and disks have dropped
sharply
l Recent desktop computers feature multiple processors and this
trend is projected to accelerate
n Databases are growing increasingly large
l large volumes of transaction data are collected and stored for later
analysis.
l multimedia objects like images are increasingly stored in
databases
n Largescale parallel database systems increasingly used for:
l storing large volumes of data
l processing timeconsuming decisionsupport queries
l providing high throughput for transaction processing
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Parallelism in Databases
n Data can be partitioned across multiple disks for parallel I/O.
n Individual relational operations (e.g., sort, join, aggregation) can be
executed in parallel
l data can be partitioned and each processor can work
independently on its own partition.
n Queries are expressed in high level language (SQL, translated to
relational algebra)
l makes parallelization easier.
n Different queries can be run in parallel with each other.
Concurrency control takes care of conflicts.
n Thus, databases naturally lend themselves to parallelism.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
I/O Parallelism
n Reduce the time required to retrieve relations from disk by partitioning
n the relations on multiple disks.
n Horizontal partitioning – tuples of a relation are divided among many disks
such that each tuple resides on one disk.
n Partitioning techniques (number of disks = n):
Roundrobin:
Send the ith tuple inserted in the relation to disk i mod n.
Hash partitioning:
l Choose one or more attributes as the partitioning attributes.
l Choose hash function h with range 0n 1
l Let i denote result of hash function h applied tothe partitioning attribute
value of a tuple. Send tuple to disk i.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
I/O Parallelism (Cont.)
n Partitioning techniques (cont.):
n Range partitioning:
l Choose an attribute as the partitioning attribute.
l A partitioning vector [vo, v1, ..., vn2] is chosen.
l Let v be the partitioning attribute value of a tuple. Tuples such that
vi ≤ vi+1 go to disk I + 1. Tuples with v < v0 go to disk 0 and tuples
with v ≥ vn2 go to disk n1.
E.g., with a partitioning vector [5,11], a tuple with partitioning attribute
value of 2 will go to disk 0, a tuple with value 8 will go to disk 1,
while a tuple with value 20 will go to disk2.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Comparison of Partitioning Techniques
n Evaluate how well partitioning techniques support the following types
of data access:
1.Scanning the entire relation.
2.Locating a tuple associatively – point queries.
l E.g., r.A = 25.
3.Locating all tuples such that the value of a given attribute lies within a
specified range – range queries.
l E.g., 10 ≤ r.A < 25.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Comparison of Partitioning Techniques (Cont.)
Round robin:
n Advantages
l Best suited for sequential scan of entire relation on each query.
l All disks have almost an equal number of tuples; retrieval work is
thus well balanced between disks.
n Range queries are difficult to process
l No clustering tuples are scattered across all disks
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Comparison of Partitioning Techniques(Cont.)
Hash partitioning:
n Good for sequential access
l Assuming hash function is good, and partitioning attributes form a
key, tuples will be equally distributed between disks
l Retrieval work is then well balanced between disks.
n Good for point queries on partitioning attribute
l Can lookup single disk, leaving others available for answering
other queries.
l Index on partitioning attribute can be local to disk, making lookup
and update more efficient
n No clustering, so difficult to answer range queries
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Comparison of Partitioning Techniques (Cont.)
n Range partitioning:
n Provides data clustering by partitioning attribute value.
n Good for sequential access
n Good for point queries on partitioning attribute: only one disk needs to
be accessed.
n For range queries on partitioning attribute, one to a few disks may need
to be accessed
l Remaining disks are available for other queries.
l Good if result tuples are from one to a few blocks.
l If many blocks are to be fetched, they are still fetched from one to a
few disks, and potential parallelism in disk access is wasted
Example of execution skew.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Partitioning a Relation across Disks
n If a relation contains only a few tuples which will fit into a single disk
block, then assign the relation to a single disk.
n Large relations are preferably partitioned across all the available disks.
n If a relation consists of m disk blocks and there are n disks available in
the system, then the relation should be allocated min(m,n) disks.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Handling of Skew
n The distribution of tuples to disks may be skewed — that is, some
disks have many tuples, while others may have fewer tuples.
n Types of skew:
l Attributevalue skew.
Some values appear in the partitioning attributes of many
tuples; all the tuples with the same value for the partitioning
attribute end up in the same partition.
Can occur with rangepartitioning and hashpartitioning.
l Partition skew.
With rangepartitioning, badly chosen partition vector may
assign too many tuples to some partitions and too few to
others.
Less likely with hashpartitioning if a good hashfunction is
chosen.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Handling Skew in RangePartitioning
n To create a balanced partitioning vector (assuming partitioning attribute
forms a key of the relation):
l Sort the relation on the partitioning attribute.
l Construct the partition vector by scanning the relation in sorted order
as follows.
After every 1/nth of the relation has been read, the value of the
partitioning attribute of the next tuple is added to the partition
vector.
l n denotes the number of partitions to be constructed.
l Duplicate entries or imbalances can result if duplicates are present in
partitioning attributes.
n Alternative technique based on histograms used in practice
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Handling Skew using Histograms
n Balanced partitioning vector can be constructed from histogram in a
relatively straightforward fashion
l Assume uniform distribution within each range of the histogram
n Histogram can be constructed by scanning relation, or sampling (blocks
containing) tuples of the relation
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Handling Skew Using Virtual Processor
Partitioning
n Skew in range partitioning can be handled elegantly using virtual
processor partitioning:
l create a large number of partitions (say 10 to 20 times the number
of processors)
l Assign virtual processors to partitions either in roundrobin fashion
or based on estimated cost of processing each virtual partition
n Basic idea:
l If any normal partition would have been skewed, it is very likely the
skew is spread over a number of virtual partitions
l Skewed virtual partitions get spread across a number of
processors, so work gets distributed evenly!
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Interquery Parallelism
n Queries/transactions execute in parallel with one another.
n Increases transaction throughput; used primarily to scale up a
transaction processing system to support a larger number of
transactions per second.
n Easiest form of parallelism to support, particularly in a sharedmemory
parallel database, because even sequential database systems support
concurrent processing.
n More complicated to implement on shareddisk or sharednothing
architectures
l Locking and logging must be coordinated by passing messages
between processors.
l Data in a local buffer may have been updated at another processor.
l Cachecoherency has to be maintained — reads and writes of data
in buffer must find latest version of data.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Cache Coherency Protocol
n Example of a cache coherency protocol for shared disk systems:
l Before reading/writing to a page, the page must be locked in
shared/exclusive mode.
l On locking a page, the page must be read from disk
l Before unlocking a page, the page must be written to disk if it was
modified.
n More complex protocols with fewer disk reads/writes exist.
n Cache coherency protocols for sharednothing systems are similar.
Each database page is assigned a home processor. Requests to fetch
the page or write it to disk are sent to the home processor.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Intraquery Parallelism
n Execution of a single query in parallel on multiple processors/disks;
important for speeding up longrunning queries.
n Two complementary forms of intraquery parallelism :
l Intraoperation Parallelism – parallelize the execution of each
individual operation in the query.
l Interoperation Parallelism – execute the different operations in a
query expression in parallel.
the first form scales better with increasing parallelism because
the number of tuples processed by each operation is typically more than
the number of operations in a query
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Parallel Processing of Relational Operations
n Our discussion of parallel algorithms assumes:
l readonly queries
l sharednothing architecture
l n processors, P0, ..., Pn1, and n disks D0, ..., Dn1, where disk Di is
associated with processor Pi.
n If a processor has multiple disks they can simply simulate a single disk
Di.
n Sharednothing architectures can be efficiently simulated on shared
memory and shareddisk systems.
l Algorithms for sharednothing systems can thus be run on shared
memory and shareddisk systems.
l However, some optimizations may be possible.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Parallel Sort
RangePartitioning Sort
n Choose processors P0, ..., Pm, where m ≤ n 1 to do sorting.
n Create rangepartition vector with m entries, on the sorting attributes
n Redistribute the relation using range partitioning
l all tuples that lie in the ith range are sent to processor Pi
l Pi stores the tuples it received temporarily on disk Di.
l This step requires I/O and communication overhead.
n Each processor Pi sorts its partition of the relation locally.
n Each processors executes same operation (sort) in parallel with other
processors, without any interaction with the others (data parallelism).
n Final merge operation is trivial: rangepartitioning ensures that, for 1 j
m, the key values in processor Pi are all less than the key values in Pj.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Parallel Sort (Cont.)
Parallel External SortMerge
n Assume the relation has already been partitioned among disks D0, ...,
Dn1 (in whatever manner).
n Each processor Pi locally sorts the data on disk Di.
n The sorted runs on each processor are then merged to get the final
sorted output.
n Parallelize the merging of sorted runs as follows:
l The sorted partitions at each processor Pi are rangepartitioned
across the processors P0, ..., Pm1.
l Each processor Pi performs a merge on the streams as they are
received, to get a single sorted run.
l The sorted runs on processors P0,..., Pm1 are concatenated to get
the final result.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Parallel Join
n The join operation requires pairs of tuples to be tested to see if they
satisfy the join condition, and if they do, the pair is added to the join
output.
n Parallel join algorithms attempt to split the pairs to be tested over
several processors. Each processor then computes part of the join
locally.
n In a final step, the results from each processor can be collected
together to produce the final result.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Partitioned Join
n For equijoins and natural joins, it is possible to partition the two input
relations across the processors, and compute the join locally at each
processor.
n Let r and s be the input relations, and we want to compute r r.A=s.B s.
n r and s each are partitioned into n partitions, denoted r0, r1, ..., rn1 and s0,
s1, ..., sn1.
n Can use either range partitioning or hash partitioning.
n r and s must be partitioned on their join attributes r.A and s.B), using the
same rangepartitioning vector or hash function.
n Partitions ri and si are sent to processor Pi,
n Each processor Pi locally computes ri ri.A=si.B si. Any of the standard
join methods can be used.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Partitioned Join (Cont.)
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
FragmentandReplicate Join
n Partitioning not possible for some join conditions
l e.g., nonequijoin conditions, such as r.A > s.B.
n For joins were partitioning is not applicable, parallelization can be
accomplished by fragment and replicate technique
l Depicted on next slide
n Special case – asymmetric fragmentandreplicate:
l One of the relations, say r, is partitioned; any partitioning
technique can be used.
l The other relation, s, is replicated across all the processors.
l Processor Pi then locally computes the join of ri with all of s using
any join technique.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Depiction of FragmentandReplicate Joins
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
FragmentandReplicate Join (Cont.)
n General case: reduces the sizes of the relations at each processor.
l r is partitioned into n partitions,r0, r1, ..., r n1;s is partitioned into m
partitions, s0, s1, ..., sm1.
l Any partitioning technique may be used.
l There must be at least m * n processors.
l Label the processors as
l P0,0, P0,1, ..., P0,m1, P1,0, ..., Pn1m1.
l Pi,j computes the join of ri with sj. In order to do so, ri is replicated
to Pi,0, Pi,1, ..., Pi,m1, while si is replicated to P0,i, P1,i, ..., Pn1,i
l Any join technique can be used at each processor Pi,j.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
FragmentandReplicate Join (Cont.)
n Both versions of fragmentandreplicate work with any join condition, since
every tuple in r can be tested with every tuple in s.
n Usually has a higher cost than partitioning, since one of the relations (for
asymmetric fragmentandreplicate) or both relations (for general fragment
andreplicate) have to be replicated.
n Sometimes asymmetric fragmentandreplicate is preferable even though
partitioning could be used.
l E.g., say s is small and r is large, and already partitioned. It may be
cheaper to replicate s across all processors, rather than repartition r
and s on the join attributes.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Partitioned Parallel HashJoin
Parallelizing partitioned hash join:
n Assume s is smaller than r and therefore s is chosen as the build
relation.
n A hash function h1 takes the join attribute value of each tuple in s and
maps this tuple to one of the n processors.
n Each processor Pi reads the tuples of s that are on its disk Di, and
sends each tuple to the appropriate processor based on hash function
h1. Let si denote the tuples of relation s that are sent to processor Pi.
n As tuples of relation s are received at the destination processors, they
are partitioned further using another hash function, h2, which is used to
compute the hashjoin locally. (Cont.)
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Partitioned Parallel HashJoin (Cont.)
n Once the tuples of s have been distributed, the larger relation r is
redistributed across the m processors using the hash function h1
l Let ri denote the tuples of relation r that are sent to processor Pi.
n As the r tuples are received at the destination processors, they are
repartitioned using the function h2
l (just as the probe relation is partitioned in the sequential hashjoin
algorithm).
n Each processor Pi executes the build and probe phases of the hash
join algorithm on the local partitions ri and s of r and s to produce a
partition of the final result of the hashjoin.
n Note: Hashjoin optimizations can be applied to the parallel case
l e.g., the hybrid hashjoin algorithm can be used to cache some of
the incoming tuples in memory and avoid the cost of writing them
and reading them back in.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Parallel NestedLoop Join
n Assume that
l relation s is much smaller than relation r and that r is stored by
partitioning.
l there is an index on a join attribute of relation r at each of the
partitions of relation r.
n Use asymmetric fragmentandreplicate, with relation s being
replicated, and using the existing partitioning of relation r.
n Each processor Pj where a partition of relation s is stored reads the
tuples of relation s stored in Dj, and replicates the tuples to every other
processor Pi.
l At the end of this phase, relation s is replicated at all sites that
store tuples of relation r.
n Each processor Pi performs an indexed nestedloop join of relation s
with the ith partition of relation r.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Other Relational Operations
Selection σθ(r)
n If θ is of the form ai = v, where ai is an attribute and v a value.
l If r is partitioned on ai the selection is performed at a single
processor.
n If θ is of the form l <= ai <= u (i.e., θ is a range selection) and the
relation has been rangepartitioned on ai
l Selection is performed at each processor whose partition overlaps
with the specified range of values.
n In all other cases: the selection is performed in parallel at all the
processors.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Other Relational Operations (Cont.)
n Duplicate elimination
l Perform by using either of the parallel sort techniques
eliminate duplicates as soon as they are found during sorting.
l Can also partition the tuples (using either range or hash
partitioning) and perform duplicate elimination locally at each
processor.
n Projection
l Projection without duplicate elimination can be performed as
tuples are read in from disk in parallel.
l If duplicate elimination is required, any of the above duplicate
elimination techniques can be used.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Grouping/Aggregation
n Partition the relation on the grouping attributes and then compute the
aggregate values locally at each processor.
n Can reduce cost of transferring tuples during partitioning by partly
computing aggregate values before partitioning.
n Consider the sum aggregation operation:
l Perform aggregation operation at each processor Pi on those
tuples stored on disk Di
results in tuples with partial sums at each processor.
l Result of the local aggregation is partitioned on the grouping
attributes, and the aggregation performed again at each processor
Pi to get the final result.
n Fewer tuples need to be sent to other processors during partitioning.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Cost of Parallel Evaluation of Operations
n If there is no skew in the partitioning, and there is no overhead due to
the parallel evaluation, expected speedup will be 1/n
n If skew and overheads are also to be taken into account, the time
taken by a parallel operation can be estimated as
Tpart + Tasm + max (T0, T1, , Tn1)
l Tpart is the time for partitioning the relations
l Tasm is the time for assembling the results
l Ti is the time taken for the operation at processor Pi
this needs to be estimated taking into account the skew, and
the time wasted in contentions.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Interoperator Parallelism
n Pipelined parallelism
l Consider a join of four relations
r1 r2 r3 r4
l Set up a pipeline that computes the three joins in parallel
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = temp1 r3
And P3 be assigned the computation of temp2 r4
l Each of these operations can execute in parallel, sending result
tuples it computes to the next operation even as it is computing
further results
Provided a pipelineable join evaluation algorithm (e.g. indexed
nested loops join) is used
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Factors Limiting Utility of Pipeline
Parallelism
n Pipeline parallelism is useful since it avoids writing intermediate results to
disk
n Useful with small number of processors, but does not scale up well with
more processors. One reason is that pipeline chains do not attain
sufficient length.
n Cannot pipeline operators which do not produce output until all inputs
have been accessed (e.g. aggregate and sort)
n Little speedup is obtained for the frequent cases of skew in which
one operator's execution cost is much higher than the others.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Independent Parallelism
n Independent parallelism
l Consider a join of four relations
r1 r2 r3 r4
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = r3 r4
And P3 be assigned the computation of temp1 temp2
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
– Can pipeline output of P1 and P2 to P3, combining
independent parallelism and pipelined parallelism
l Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system,
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Query Optimization
n Query optimization in parallel databases is significantly more complex
than query optimization in sequential databases.
n Cost models are more complicated, since we must take into account
partitioning costs and issues such as skew and resource contention.
n When scheduling execution tree in parallel system, must decide:
l How to parallelize each operation and how many processors to
use for it.
l What operations to pipeline, what operations to execute
independently in parallel, and what operations to execute
sequentially, one after the other.
n Determining the amount of resources to allocate for each operation is
a problem.
l E.g., allocating more processors than optimal can result in high
communication overhead.
n Long pipelines should be avoided as the final operation may wait a lot
for inputs, while holding precious resources
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Query Optimization (Cont.)
n The number of parallel evaluation plans from which to choose from is much
larger than the number of sequential evaluation plans.
l Therefore heuristics are needed while optimization
n Two alternative heuristics for choosing parallel plans:
l No pipelining and interoperation pipelining; just parallelize every
operation across all processors.
Finding best plan is now much easier use standard optimization
technique, but with new cost model
Volcano parallel database popularize the exchangeoperator model
– exchange operator is introduced into query plans to partition and
distribute tuples
– each operation works independently on local data on each
processor, in parallel with other copies of the operation
l First choose most efficient sequential plan and then choose how best to
parallelize the operations in that plan.
Can explore pipelined parallelism as an option
n Choosing a good physical organization (partitioning technique) is important
to speed up queries.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Design of Parallel Systems
Some issues in the design of parallel systems:
n Parallel loading of data from external sources is needed in order to
handle large volumes of incoming data.
n Resilience to failure of some processors or disks.
l Probability of some disk or processor failing is higher in a parallel
system.
l Operation (perhaps with degraded performance) should be
possible in spite of failure.
l Redundancy achieved by storing extra copy of every data item at
another processor.
©Silberschatz, Korth and Sudarshan21.Database System Concepts 5th Edition, Aug 22, 2005.
Design of Parallel Systems (Cont.)
n Online reorganization of data and schema changes must be
supported.
l For example, index construction on terabyte databases can take
hours or days even on a parallel system.
Need to allow other processing (insertions/deletions/updates)
to be performed on relation even as index is being constructed.
l Basic idea: index construction tracks changes and ``catches up'‘
on changes at the end.
n Also need support for online repartitioning and schema changes
(executed concurrently with other processing).
Database System Concepts, 5th Ed.
©Silberschatz, Korth and Sudarshan
See www.dbbook.com for conditions on reuse
End of Chapter
Các file đính kèm theo tài liệu này:
- ch21_7877_9351.pdf