Thick Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar .

Uploaded on:
Category: Education / Career
Point Overview . Network Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations . Matix Algorithms: Introduction . Because of their consistent structure, parallel calculations including grids and vectors promptly loan themselves to information deterioration. Common calculations depend on info, yield, or moderate information deterioration. Most calculations utilize one-and two-dimensiona
Slide 1

Thick Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To go with the content "Prologue to Parallel Computing", Addison Wesley, 2003.

Slide 2

Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations

Slide 3

Matix Algorithms: Introduction Due to their normal structure, parallel calculations including frameworks and vectors promptly loan themselves to information disintegration. Regular calculations depend on information, yield, or middle information deterioration. Most calculations utilize one-and two-dimensional square, cyclic, and piece cyclic partitionings.

Slide 4

Matrix-Vector Multiplication We intend to duplicate a thick n x n lattice A with a n x 1 vector x to yield the n x 1 result vector y. The serial calculation requires n 2 increases and augmentations.

Slide 5

Matrix-Vector Multiplication: Rowwise 1-D Partitioning The n x n network is divided among n processors, with every processor putting away total column of the grid. The n x 1 vector x is circulated to such an extent that every procedure possesses one of its components.

Slide 6

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Multiplication of a n x n grid with a n x 1 vector utilizing rowwise square 1-D apportioning. For the one-push per-prepare case, p = n .

Slide 7

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Since every procedure begins with just a single component of x , an all-to-all communicate is required to disseminate every one of the components to every one of the procedures. Prepare P i now figures . The all-to-all communicate and the calculation of y [ i ] both require significant investment Θ (n) . Consequently, the parallel time is Θ (n) .

Slide 8

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Consider now the situation when p < n and we utilize square 1D apportioning. Every procedure at first stores n=p finish columns of the lattice and a bit of the vector of size n=p . The all-to-all communicate happens among p forms and includes messages of size n=p . This is trailed by n=p neighborhood dab items. Therefore, the parallel run time of this methodology is This is cost-ideal.

Slide 9

Matrix-Vector Multiplication: Rowwise 1-D Partitioning Scalability Analysis: We realize that T 0 = pT P - W , in this manner, we have, For isoefficiency, we have W = KT 0 , where K = E/(1 – E) for coveted productivity E . From this, we have W = O(p 2 ) (from the t w term). There is likewise a bound on isoefficiency on account of simultaneousness. For this situation, p < n, along these lines, W = n 2 = Ω (p 2 ). General isoefficiency is W = O(p 2 ).

Slide 10

Matrix-Vector Multiplication: 2-D Partitioning The n x n lattice is parceled among n 2 processors with the end goal that every processor claims a solitary component. The n x 1 vector x is circulated just in the keep going segment of n processors.

Slide 11

Matrix-Vector Multiplication: 2-D Partitioning Matrix-vector increase with piece 2-D apportioning. For the one-component per-handle case, p = n 2 if the lattice size is n x n .

Slide 12

Matrix-Vector Multiplication: 2-D Partitioning We should first adjust the vector to the lattice properly. The main correspondence venture for the 2-D apportioning adjusts the vector x along the foremost slanting of the grid. The second step duplicates the vector components from every corner to corner procedure to every one of the procedures in the comparing section utilizing n synchronous communicates among all processors in the segment. At long last, the outcome vector is figured by playing out an all-to-one decrease along the segments.

Slide 13

Matrix-Vector Multiplication: 2-D Partitioning Three fundamental correspondence operations are utilized as a part of this calculation: coordinated correspondence to adjust the vector along the principle corner to corner, one-to-all communicate of every vector component among the n procedures of every segment, and all-to-one decrease in every column. Each of these operations takes Θ (log n ) time and the parallel time is Θ (log n ) . The cost (handle time item) is Θ ( n 2 log n ) ; henceforth, the calculation is not taken a toll ideal.

Slide 14

Matrix-Vector Multiplication: 2-D Partitioning When utilizing less than n 2 processors, every procedure claims a piece of the grid. The vector is appropriated in segments of components in the last procedure section as it were. For this situation, the message sizes for the arrangement, communicate, and lessening are all . The calculation is a result of a submatrix with a vector of length .

Slide 15

Matrix-Vector Multiplication: 2-D Partitioning The primary arrangement step requires some investment The communicate and diminishments require significant investment Local framework vector items require some investment Total time is

Slide 16

Matrix-Vector Multiplication: 2-D Partitioning Scalability Analysis: Equating T 0 with W , term by term, for isoefficiency, we have, as the prevailing term. The isoefficiency because of simultaneousness is O(p). The general isoefficiency is (because of the system transmission capacity). For cost optimality, we have, . For this, we have,

Slide 17

Matrix-Matrix Multiplication Consider the issue of increasing two n x n thick, square lattices An and B to yield the item grid C = A x B . The serial multifaceted nature is O(n 3 ). We don\'t consider better serial calculations (Strassen\'s technique), in spite of the fact that, these can be utilized as serial portions in the parallel calculations. A helpful idea for this situation is called piece operations. In this view, a n x n network A can be viewed as a q x q exhibit of obstructs An i,j (0 ≤ i, j < q ) with the end goal that every piece is a (n/q) x (n/q) submatrix. In this view, we perform q 3 lattice duplications, each including (n/q) x (n/q) frameworks.

Slide 18

Matrix-Matrix Multiplication Consider two n x n lattices An and B divided into p obstructs An i,j and B i,j (0 ≤ i, j < ) of size each. Prepare P i,j at first stores An i,j and B i,j and processes square C i,j of the outcome network. Figuring submatrix C i,j requires all submatrices An i,k and B k,j for 0 ≤ k < . All-to-all communicate squares of An along lines and B along sections. Perform nearby submatrix increase.

Slide 19

Matrix-Matrix Multiplication The two communicates require significant investment The calculation requires duplications of estimated submatrices. The parallel run time is around The calculation is cost ideal and the isoefficiency is O(p 1.5 ) because of transfer speed term t w and simultaneousness. Real disadvantage of the calculation is that it is not memory ideal.

Slide 20

Matrix-Matrix Multiplication: Cannon\'s Algorithm In this calculation, we plan the calculations of the processes of the i th line with the end goal that, at any given time, every procedure is utilizing an alternate piece An i,k . These squares can be efficiently turned among the procedures after each submatrix increase so that each procedure gets a crisp An i,k after every revolution.

Slide 21

Matrix-Matrix Multiplication: Cannon\'s Algorithm Communication ventures in Cannon\'s calculation on 16 forms.

Slide 22

Matrix-Matrix Multiplication: Cannon\'s Algorithm Align the pieces of An and B in a manner that every procedure increases its nearby submatrices. This is finished by moving all submatrices An i,j to one side (with wraparound) by i steps and all submatrices B i,j up (with wraparound) by j steps. Perform neighborhood square duplication. Every piece of A moves one stage left and every square of B moves one stage up (again with wraparound). Perform next square increase, add to fractional outcome, rehash until the sum total of what pieces have been duplicated.

Slide 23

Matrix-Matrix Multiplication: Cannon\'s Algorithm In the arrangement venture, since the most extreme separation over which a square moves is , the two move operations require a sum of time. Each of the single-stride moves in the register and-move period of the calculation requires some investment. The calculation time for duplicating lattices of size is . The parallel time is roughly: The cost-effectiveness and isoefficiency of the calculation are indistinguishable to the principal calculation, with the exception of, this is memory ideal.

Slide 24

Matrix-Matrix Multiplication: DNS Algorithm Uses a 3-D dividing. Imagine the grid augmentation calculation as a 3D square . grids An and B come in two orthogonal faces and result C turns out the other orthogonal face. Each interior hub in the shape speaks to a solitary include increase operation (and along these lines the intricacy). DNS calculation parcels this solid shape utilizing a 3-D square plan.

Slide 25

Matrix-Matrix Multiplication: DNS Algorithm Assume a n x n x n work of processors. Move the segments of An and columns of B and perform communicate. Every processor figures a solitary include increase. This is trailed by an aggregation along the C measurement. Since every include duplicate takes consistent time and aggregation and communicate takes log n time, the aggregate runtime is log n . This is not taken a toll ideal. It can be made cost ideal by utilizing n/log n processors along the heading of aggregation.

Slide 26

Matrix-Matrix Multiplication: DNS Algorithm The correspondence ventures in the DNS calculation while duplicating 4 x 4 frameworks An and B on 64 forms.

Slide 27

Matrix-Matrix Multiplication: DNS Algorithm Using less than n 3 processors. Expect that the quantity of procedures p is equivalent to q 3 for some q < n . The two lattices are divided into pieces of size (n/q) x (n/q ). Every framework can in this manner be viewed as a q x q two-dimensional square cluster of pieces. The calculation takes after fr

View more...