Shouldn't something be said about Multicore Michael A. Heroux Sandia National Laboratories .

Uploaded on:
Category: Fashion / Beauty
Plot. Would we be able to utilize shared memory parallel (SMP) only?Can we utilize appropriated memory parallel (DMP) only?Possibilities for utilizing SMP inside of DMP.Performance portrayal for preconditioned Krylov methods.Possibilities for SMP with DMP for each Krylov operation.Implications for multicore chips.About MPI.If Time Permits: Useful Parallel Abstractions..
Slide 1

Shouldn\'t something be said about Multicore? Michael A. Heroux Sandia National Laboratories Sandia is a multiprogram lab worked by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under contract DE-AC04-94AL85000.

Slide 2

Outline Can we utilize shared memory parallel (SMP) as it were? Will we utilize appropriated memory parallel (DMP) as it were? Conceivable outcomes for utilizing SMP inside DMP. Execution portrayal for preconditioned Krylov strategies. Potential outcomes for SMP with DMP for each Krylov operation. Suggestions for multicore chips. About MPI. In the event that Time Permits: Useful Parallel Abstractions.

Slide 3

SMP-just Possibilities Question: Is it conceivable to build up a versatile parallel application utilizing just shared memory parallel programming?

Slide 4

SMP-just Observations Developing a versatile SMP application requires as much work as DMP: Still should decide responsibility for and information. Failure to declare position of information on DSM structures is huge issue, not effortlessly settled. Many studys shows this point. SMP application requires SMP machine: Much more costly per processor than DMP machine. Poorer adaptation to internal failure properties. Number of processor usable by SMP application is restricted by least of: Operating System and Programming Environment bolster. Worldwide Address Space. Physical processor tally. Of these, the OS/Programming Model is the genuine constraining element.

Slide 5

SMP-just Possibilities Question: Is it conceivable to build up an adaptable parallel application utilizing just shared memory parallel programming? Reply: No.

Slide 6

DMP-just Possibilities Question: Is it conceivable to build up a versatile parallel application utilizing just dispersed memory parallel programming? Reply: Don\'t have to inquire. Versatile DMP applications are unmistakably conceivable to O(100K-1M ) processors. In this manner: DMP is required for versatile parallel applications. Question: Is there still a part for SMP inside a DMP application?

Slide 7

SMP-Under-DMP Possibilities Can we profit by utilizing SMP inside DMP? Case: OpenMP inside a MPI procedure. How about we take a gander at a diffusing of information focuses that may offer assistance.

Slide 8

Test Platform: Clovertown Intel: Clovertown, Quad-center (really two double centers) Performance results depend on 1.86 GHz rendition

Slide 9

LAMMPS Strong Scaling

Slide 10

HPC Conjugate Gradient

Slide 11

Trilinos/Epetra MPI Results Bandwidth Usage versus Center Usage

Slide 12

SpMV MPI+pthreads Theme: Programming model doesn\'t make a difference if calculation is the same.

Slide 13

Double-twofold dab item MPI+pthreads Same subject.

Slide 14

Classical DFT code. Parts of code: Speedup is extraordinary. Parts: Speedup irrelevant.

Slide 15

Closer look: 4-8 centers. 1 center: Solver is 12.7 of 289 sec (4.4%) 8 centers: Solver is 7.5 of 16.8 sec (44%).

Slide 16

Summary So Far MPI-just is now and sufficiently then: LAMMPS Tramonto (at any rate parts), and strings won\'t not help solvers. Bringing strings into MPI: Not valuable if utilizing same calculations. Same conclusion as 12 years prior. Increment in data transmission necessities: Decreases viable center utilize. Free of programming model. Utilization of threading may be successful in the event that it empowers: Change of calculation. Better load adjusting.

Slide 17

Case Study: Linear Equation Solvers Sandia has many building applications. An expansive portion of more up to date applications are certain in nature: Requires arrangement of numerous vast nonlinear frameworks. Comes down to numerous meager straight frameworks. Straight framework settles are extensive portion of aggregate time. Little as 30%. Vast as 90+%. Iterative solvers most generally utilized. Iterative solvers have little modest bunch of imperative bits. We concentrate on execution issues for these pieces. Proviso: These parts don\'t make the entire, yet are a decent lump of it…

Slide 18

Problem Definition A continuous prerequisite for logical and building figuring is to settle: Ax = b where A will be a known expansive (inadequate) framework, b is a known vector, x is an obscure vector. Objective: Find x . Strategy: Use Preconditioned Conjugate Gradient (PCG) technique, Or one of numerous variations, e.g., Preconditioned GMRES. Called Krylov Solvers.

Slide 19

Performance Characteristics of Preconditioned Krylov Solvers The execution of a parallel preconditioned Krylov solver on any given machine can be described by the execution of the accompanying operations: Vector refreshes: Dot Products: Matrix augmentation: Preconditioner application: What can SMP inside DMP do to enhance execution for these operations?

Slide 20

Machine Block Diagram Node 0 Node 1 Node m-1 Memory PE 0 PE n-1 PE 0 PE n-1 PE 0 PE n-1 Parallel machine with p = m * n processors, m = number of hubs. n = number of shared memory processors per hub. Consider p MPI forms versus m MPI forms with n strings per MPI handle (settled information parallel).

Slide 21

Vector Update Performance Vector calculations are not (decidedly) affected utilizing settled parallelism. These counts are uninformed that they are being done in parallel. Issues of information area and false store line sharing can really debase execution for settled approach. Illustration: What happens if PE 0 must refresh x[j]. PE 1 must refresh x[j+1] and x[j] and x[j+1] are in a similar reserve line? Take note of: These same perceptions hold for FEM/FVM estimations and numerous other basic information parallel calculations.

Slide 22

Dot Product Performance Global speck item execution can be enhanced utilizing settled parallelism: Compute the incomplete dab item on every hub before going to twofold decrease calculation: O(log(m)) worldwide synchronization steps versus O(log(p)) for DMP-as it were. In any case, same can be proficient utilizing "SMP-mindful" message passing library like LIBSM. Takes note of: A SMP-mindful message passing library addresses a significant number of the underlying execution issues while porting a MPI code to SMP hubs. Reason? Not bring down idleness of intra-hub message but rather lessened off-hub arrange request.

Slide 23

Matrix Multiplication Performance Typical circulated scanty grid increase requires "limit trade" before figuring. Time for trade is dictated by longest idleness remote get to. Utilizing SMP inside a hub does not lessen this inactivity. SMP grid increase has same reserve execution issues as vector updates. Along these lines SMP inside DMP for network augmentation is not appealing.

Slide 24

Batting Average So Far: 0 for 3 So far there is no convincing motivation to consider SMP inside a DMP application. Issue: Nothing we have proposed so far gives a basic change in calculations. Must scan for circumstances where SMP gives a capacity that DMP can\'t. One plausibility: Addressing cycle swelling in (Overlapping) Schwarz space decay preconditioning.

Slide 25

Iteration Inflation Overlapping Schwarz Domain Decomposition (Local ILU(0) with GMRES)

Slide 26

Using Level Scheduling SMP As the quantity of subdomains builds, cycle checks go up. Asymptotically, (non-covering) Schwarz gets to be distinctly askew scaling. Yet, note: ILU has parallelism because of sparsity of framework. We can utilize parallelism inside ILU to diminish the swelling impact.

Slide 27

Defining Levels

Slide 28

Some Sample Level Schedule Stats Linear FE premise fns on 3D matrix Avg nnz/level = 5500, Avg lines/level = 173.

Slide 29

Linear Stability Analysis Problem Unstructured space, 21K eq, 923K nnz

Slide 30

Some Sample Level Schedule Stats Unstructured direct steadiness investigation issue Avg nnz/level = 520, Avg lines/level = 23.

Slide 31

Improvement Limits Assume number of PEs per hub = n . Accept speedup for level planned F/B settle matches speedup of n MPI fathoms on same hub. At that point execution change is For past diagram and n = 8, p = 128 ( m = 16), proportion = 203/142 = 1.43 or 43%.

Slide 32

Practical Limitations Level booking speedup is to a great extent controlled by the cost of synchronization on a node. F/B settle requires a synchronization after each level. On machines with great equipment boundary, this is not an issue and fantastic accelerate can be normal. On different machines, this can be an issue.

Slide 33

Reducing Synchronization Restrictions Use an adaptable iterative technique, for example, FGMRES. Preconditioner at every cycle require not be the same, accordingly no requirement for sync\'ing after each level. Level updates will at present be around complied. Computational and correspondence multifaceted nature is indistinguishable to DMP-just F/B comprehend. Cycle tallies and cost per emphasis go up. Multi-shading reordering: Reorder conditions to expand level-set sizes. Extreme increment in cycle numbers. Our saying: The best parallel calculation is the best parallel execution of the best serial calculation.

Slide 34

SMP-Under-DMP Possibilities Can we profit by utilizing SMP inside DMP? Yes, however: Must have the capacity to exploit fine-grain shared memory information get to. In a way not practical for MPI-alone. All things being equal: Nested SMP-Under-DMP is extremely unpredictable to program. A great many people reply, "It\'s not justified, despite any potential benefits."

Slide 35

Summary So Far SMP alone is inadequate for adaptable parallelism. DMP alone is surely adequate, however would we be able to enhance by particular utilization of SMP inside DMP? Examination of key preconditioned Krylov portions gives understanding into potential outcomes of utilizing SMP with DMP, and results can be stretched out to different calculations. The greater part of the straight-forward procedures for bringing SMP into DMP won\'t work. Level booked ILU is one conceivable case of successfully utilizing SMP inside DMP (not generally attractive). Most productive utilization of SMP inside DMP appears to have a typical topic of permitting numerous procedures to have dynamic nonconcurrent access to substantial (read-just) informational indexes.

Slide 36

Implications for Multicore Chips MPI-just utilization of multicore is a respectable choice. May be a definitive right response for adaptability and simplicity of programming. Suspicion: MPI is multicore

View more...