High Performance for Parallel Applications with Advanced Scheduling

This paper presents a solution for delivering high performance to parallel applications through advanced scheduling techniques. The authors introduce their approach, highlighting code generation, computation, data distribution, and communication schemes. Experimental results are analyzed and future work outlined.

Uploaded on Jun 01, 2023 | 0 Views
jewelwelch

About High Performance for Parallel Applications with Advanced Scheduling

PowerPoint presentation about 'High Performance for Parallel Applications with Advanced Scheduling'. This presentation describes the topic on This paper presents a solution for delivering high performance to parallel applications through advanced scheduling techniques. The authors introduce their approach, highlighting code generation, computation, data distribution, and communication schemes. Experimental results are analyzed and future work outlined.. The key topics included in this slideshow are parallel applications, high performance, advanced scheduling, computation, data distribution,. Download this presentation absolutely free.

Presentation Transcript

1. Delivering High Performance to Parallel Applications Using Advanced Scheduling Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris National Technical University of Athens Computing Systems Laboratory {ndros,goumas,maria,nkoziris}@cslab.ece.ntua.gr www.cslab.ece.ntua.gr

2. 5/9/2003 Parallel Computing 2003 2 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

3. 5/9/2003 Parallel Computing 2003 3 Introduction Introduction Motivation: A lot of theoretical work has been done on arbitrary tiling but there are no actual experimental results! There is no complete method to generate code for non-rectangular tiles

4. 5/9/2003 Parallel Computing 2003 4 Introduction Introduction Contribution: Complete end-to-end SPMD code generation method for arbitrarily tiled iteration spaces Simulation of blocking and non-blocking communication primitives Experimental evaluation of proposed scheduling scheme

5. 5/9/2003 Parallel Computing 2003 5 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

6. 5/9/2003 Parallel Computing 2003 6 Background Background Algorithmic Model: FOR j 1 = min 1 TO max 1 DO FOR j n = min n TO max n DO Computation(j 1 ,,j n ); ENDFOR ENDFOR Perfectly nested loops Constant flow data dependencies (D)

7. 5/9/2003 Parallel Computing 2003 7 Background Background Tiling: Popular loop transformation Groups iterations into atomic units Enhances locality in uniprocessors Enables coarse-grain parallelism in distributed memory systems Valid tiling matrix H:

8. 5/9/2003 Parallel Computing 2003 8 Tiling Transformation Tiling Transformation Example: FOR j 1 =0 TO 11 DO FOR j 2 =0 TO 8 DO A[j 1 ,j 2 ]:=A[j 1 -1,j 2 ] + A[j 1 -1,j 2 -1]; ENDFOR ENDFOR

9. 5/9/2003 Parallel Computing 2003 9 Rectangular Tiling Transformation Rectangular Tiling Transformation 3 0 P = 0 3 1/3 0 H = 0 1/3

10. 5/9/2003 Parallel Computing 2003 10 Non-rectangular Tiling Transformation Non-rectangular Tiling Transformation 3 3 P = 0 3 1/3 -1/3 H = 0 1/3

11. 5/9/2003 Parallel Computing 2003 11 Why Non-rectangular Tiling? Why Non-rectangular Tiling? Reduces communication Enables more efficient scheduling schemes 8 communication points 6 communication points 6 time steps 5 time steps

12. 5/9/2003 Parallel Computing 2003 12 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

13. 5/9/2003 Parallel Computing 2003 13 Computation Distribution Computation Distribution We map tiles along the longest dimension to the same processor because: It reduces the number of processors required It simplifies message-passing It reduces total execution times when overlapping computation with communication

14. 5/9/2003 Parallel Computing 2003 14 Computation Distribution Computation Distribution P3 P2 P1

15. 5/9/2003 Parallel Computing 2003 15 Data Distribution Data Distribution Computer-owns rule: Each processor owns the data it computes Arbitrary convex iteration space, arbitrary tiling Rectangular local iteration and data spaces

16. 5/9/2003 Parallel Computing 2003 16 Data Distribution Data Distribution

17. 5/9/2003 Parallel Computing 2003 17 Data Distribution Data Distribution

18. 5/9/2003 Parallel Computing 2003 18 Data Distribution Data Distribution

19. 5/9/2003 Parallel Computing 2003 19 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

20. 5/9/2003 Parallel Computing 2003 20 P3 P2 P1 Communication Schemes Communication Schemes With whom do I communicate?

21. 5/9/2003 Parallel Computing 2003 21 P3 P2 P1 With whom do I communicate? Communication Schemes Communication Schemes

22. 5/9/2003 Parallel Computing 2003 22 What do I send? Communication Schemes Communication Schemes

23. 5/9/2003 Parallel Computing 2003 23 Blocking Scheme Blocking Scheme P3 P2 P1 j 1 j 2 12 time steps

24. 5/9/2003 Parallel Computing 2003 24 Non-blocking Scheme Non-blocking Scheme P3 P2 P1 j 1 j 2 6 time steps

25. 5/9/2003 Parallel Computing 2003 25 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

26. 5/9/2003 Parallel Computing 2003 26 Code Generation Summary Code Generation Summary Advanced Scheduling = Suitable Tiling + Non-blocking Communication Scheme Sequential Code Dependence Analysis Parallelization Parallel SPMD Code Tiling Transformation Sequential Tiled Code Tiling Computation Distribution Data Distribution Communication Primitives

27. 5/9/2003 Parallel Computing 2003 27 Code Summary Blocking Scheme Code Summary Blocking Scheme

28. 5/9/2003 Parallel Computing 2003 28 Code Summary Non-blocking Scheme Code Summary Non-blocking Scheme

29. 5/9/2003 Parallel Computing 2003 29 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

30. 5/9/2003 Parallel Computing 2003 30 Experimental Results Experimental Results 8-node SMP Linux Cluster (800 MHz PIII, 128 MB RAM, kernel 2.4.20) MPICH v.1.2.5 ( --with-device=p4, --with-comm=shared ) g++ compiler v.2.95.4 ( -O3 ) FastEthernet interconnection 2 micro-kernel benchmarks (3D): Gauss Successive Over-Relaxation (SOR) Texture Smoothing Code (TSC) Simulation of communication schemes

31. 5/9/2003 Parallel Computing 2003 31 SOR SOR Iteration space M x N x N Dependence matrix: Rectangular Tiling: Non-rectangular Tiling:

32. 5/9/2003 Parallel Computing 2003 32 SOR SOR

33. 5/9/2003 Parallel Computing 2003 33 SOR SOR

34. 5/9/2003 Parallel Computing 2003 34 TSC TSC Iteration space T x N x N Dependence matrix: Rectangular Tiling: Non-rectangular Tiling:

35. 5/9/2003 Parallel Computing 2003 35 TSC TSC

36. 5/9/2003 Parallel Computing 2003 36 TSC TSC

37. 5/9/2003 Parallel Computing 2003 37 Overview Overview Introduction Background Code Generation Computation/Data Distribution Communication Schemes Summary Experimental Results Conclusions Future Work

38. 5/9/2003 Parallel Computing 2003 38 Conclusions Conclusions Automatic code generation for arbitrary tiled spaces can be efficient High performance can be achieved by means of a suitable tiling transformation overlapping computation with communication

39. 5/9/2003 Parallel Computing 2003 39 Future Work Future Work Application of methodology to imperfectly nested loops and non-constant dependencies Investigation of hybrid programming models (MPI+OpenMP) Performance evaluation on advanced interconnection networks (SCI, Myrinet)

40. 5/9/2003 Parallel Computing 2003 40 Questions? Questions? http://www.cslab.ece.ntua.gr/~ndros