Versatile MPI: Insightful runtime strategies - PowerPoint PPT Presentation

adaptive mpi intelligent runtime strategies and performance prediction via simulation l.
Skip this Video
Loading SlideShow in 5 Seconds..
Versatile MPI: Insightful runtime strategies PowerPoint Presentation
Versatile MPI: Insightful runtime strategies

play fullscreen
1 / 61
Download
Download Presentation

Versatile MPI: Insightful runtime strategies

Presentation Transcript

  1. Adaptive MPI: Intelligent runtime strategies  and performance prediction via simulation Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign ICL@UTK

  2. PPL Mission and Approach • To enhance Performance and Productivity in programming complex parallel applications • Performance: scalable to thousands of processors • Productivity: of human programmers • Complex: irregular structure, dynamic variations • Approach: Application Oriented yet CS centered research • Develop enabling technology, for a wide collection of apps. • Develop, use and test it in the context of real applications • How? • Develop novel Parallel programming techniques • Embody them into easy to use abstractions • So, application scientist can use advanced techniques with ease • Enabling technology: reused across many apps ICL@UTK

  3. Develop abstractions in context of full-scale applications Protein Folding Quantum Chemistry (QM/MM) Molecular Dynamics Computational Cosmology Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Dendritic Growth Space-time meshes Rocket Simulation The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE ICL@UTK

  4. Benefits • Software engineering • Number of virtual processors can be independently controlled • Separate VPs for different modules • Message driven execution • Adaptive overlap of communication • Predictability : • Automatic out-of-core • Asynchronous reductions • Dynamic mapping • Heterogeneous clusters • Vacate, adjust to speed, share • Automatic checkpointing • Change set of processors used • Automatic dynamic load balancing • Communication optimization System implementation User View Migratable Objects (aka Processor Virtualization) Programmer: [Over] decomposition into virtual processors Runtime:Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI ICL@UTK

  5. Adaptive MPI Load Balancing Fault tolerance Projections: performance analysis Performance prediction bigsim Outline ICL@UTK

  6. MPI processes MPI “processes” Implemented as virtual processes (user-level migratable threads) Real Processors AMPI: MPI with Virtualization • Each virtual process implemented as a user-level thread embedded in a Charm object ICL@UTK

  7. Making AMPI work • Multiple user-level threads per processor: • Problems with global variable • Solution I: • Automatic: switch GOT pointer at context switch • Available on most machines • Solution 2: Manual: replace global variables • Solution 3: automatic: via compiler support (AMPIzer) • Migrating Stacks • Use isomalloc technique (Mehaut et al) • Use memory files and mmap() • Heap data: • Isomalloc heaps • OR user supplied pack/unpack functions for the heap data ICL@UTK

  8. ELF and global variables • The Executable and Linking Format (ELF) • Executable has a Global Offset Table containing global data • GOT pointer stored at %ebx register • Switch this pointer when switching between threads • Support on Linux, Solaris 2.x, and more • Integrated in Charm++/AMPI • Invoke by compile time option -swapglobal ICL@UTK

  9. Adaptive overlap and modules SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995 ICL@UTK

  10. Benefit of Adaptive Overlap Problem setup: 3D stencil calculation of size 2403 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8. ICL@UTK

  11. Performance Slightly worse w/o optimization Being improved Flexibility Small number of PE available Special requirement by algorithm Comparison with Native MPI Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #. ICL@UTK

  12. AMPI Extensions • Automatic load balancing • Non-blocking collectives • Checkpoint/restart mechanism • Multi-module programming ICL@UTK

  13. Load Balancing in AMPI • Automatic load balancing: MPI_Migrate() • Collective call informing the load balancer that the thread is ready to be migrated, if needed. ICL@UTK

  14. Load Balancing Steps Regular Timesteps Detailed, aggressive Load Balancing : object migration Instrumented Timesteps Refinement Load Balancing ICL@UTK

  15. Load Balancing Aggressive Load Balancing Refinement Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. ICL@UTK

  16. Shrink/Expand • Problem: Availability of computing platform may change • Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600 ICL@UTK

  17. Radix Sort Optimized All-to-all “Surprise” Completion time vs. computation overhead 76 bytes all-to-all on Lemieux CPU is free during most of the time taken by a collective operation 900 800 700 600 Led to the development of Asynchronous Collectives now supported in AMPI AAPC Completion Time(ms) 500 400 300 200 100 0 100B 200B 900B 4KB 8KB Message Size (bytes) Mesh Direct ICL@UTK

  18. Asynchronous Collectives • Our implementation is asynchronous • Collective operation posted • Test/wait for its completion • Meanwhile useful computation can utilize CPU MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); ICL@UTK

  19. Fault Tolerance ICL@UTK

  20. Motivation • Applications need fast, low cost and scalable fault tolerance support • As machines grow in size • MTBF decreases • Applications have to tolerate faults • Our research • Disk based Checkpoint/Restart • In Memory Double Checkpointing/Restart • Sender based Message Logging • Proactive response to fault prediction • (impending fault response) ICL@UTK

  21. Checkpoint/Restart Mechanism • Automatic Checkpointing for AMPI and Charm++ • Migrate objects to disk! • Automatic fault detection and restart • Now available in distribution version of AMPI and Charm++ • Blocking Co-ordinated Checkpoint • States of chares are checkpointed to disk • Collective call MPI_Checkpoint(DIRNAME) • The entire job is restarted • Virtualization allows restarting on different # of processors • Runtime option • > ./charmrun pgm +p4 +vp16 +restart DIRNAME • Simple but effective for common cases ICL@UTK

  22. In-memory Double Checkpoint • In-memory checkpoint • Faster than disk • Co-ordinated checkpoint • Simple MPI_MemCheckpoint(void) • User can decide what makes up useful state • Double checkpointing • Each object maintains 2 checkpoints: • Local physical processor • Remote “buddy” processor • For jobs with large memory • Use local disks! 32 processors with 1.5GB memory each ICL@UTK

  23. Restart • A “Dummy” process is created: • Need not have application data or checkpoint • Necessary for runtime • Starts recovery on all other Processors • Other processors: • Remove all chares • Restore checkpoints lost on the crashed PE • Restore chares from local checkpoints • Load balance after restart ICL@UTK

  24. Restart Performance • 10 crashes • 128 processors • Checkpoint every 10 time steps ICL@UTK

  25. Scalable Fault Tolerance • How? • Sender-side message logging • Asynchronous Checkpoints on buddy processors • Latency tolerance mitigates costs • Restart can be speeded up by spreading out objects from failed processor • Current progress • Basic scheme idea implemented and tested in simple programs • General purpose implementation in progress Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue” ICL@UTK

  26. Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s) ICL@UTK

  27. Projections: Performance visualization and analysis tool ICL@UTK

  28. An Introduction to Projections • Performance Analysis tool for Charm++ based applications. • Automatic trace instrumentation. • Post-mortem visualization. • Multiple advanced features that support: • Data volume control • Generation of additional user data. ICL@UTK

  29. Trace Generation • Automatic instrumentation by runtime system • Detailed • In the log mode each event is recorded in full detail (including timestamp) in an internal buffer. • Summary • Reduces the size of output files and memory overhead. • Produces a few lines of output data per processor. • This data is recorded in bins corresponding to intervals of size 1 ms by default. • Flexible • APIs and runtime options for instrumenting user events and data generation control. ICL@UTK

  30. The Summary View • Provides a view of the overall utilization of the application. • Very quick to load. ICL@UTK

  31. Graph View • Features: • Selectively view Entry points. • Convenient means to switch to between axes data types. ICL@UTK

  32. Timeline • The most detailed view in Projections. • Useful for understanding critical path issues or unusual entry point behaviors at specific times. ICL@UTK

  33. Animations ICL@UTK

  34. ICL@UTK

  35. ICL@UTK

  36. ICL@UTK

  37. Time Profile • Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool. Solved by prioritizing entry methods ICL@UTK

  38. ICL@UTK

  39. ICL@UTK

  40. Overview: one line for each processor, time on X-axis White: busy, Black:idle Red: intermediate ICL@UTK

  41. A boring but good-performance overview ICL@UTK

  42. An interesting but pathetic overview ICL@UTK

  43. Stretch Removal Histogram ViewsNumber of function executions vs. their granularityNote: log scale on Y-axis After Optimizations About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms Before Optimizations Over 16 large stretched calls ICL@UTK

  44. Miscellaneous Features -Color Selection • Colors are automatically supplied by default. • We allow users to select their own colors and save them. • These colors can then be restored the next time Projections loads. ICL@UTK

  45. User APIs • Controlling trace generation • void traceBegin() • void traceEnd() • Tracing User (Events • int traceRegisterUserEvent(char *, int) • void traceUserEvent(char *) • void traceUserBracketEvent(int, double, double) • double CmiWallTimer() • Runtime options: • +traceoff • +traceroot <directory> • Projections mode only: • +logsize <# entries> • +gz-trace • Summary mode only: • +bincount <# of intervals> • +binsize <interval time quanta (us)> ICL@UTK

  46. Performance Prediction Via Parallel Simulation ICL@UTK

  47. BigSim: Performance Prediction • Extremely large parallel machines are around already/about to be available: • ASCI Purple (12k, 100TF) • Bluegene/L (64k, 360TF) • Bluegene/C (1M, 1PF) • How to write a petascale application? • What will be the Performance like? • Would existing parallel applications scale? • Machines are not there • Parallel Performance is hard to model without actually running the program ICL@UTK

  48. Objectives and Simualtion Model • Objectives: • Develop techniques to facilitate the development of efficient peta-scale applications • Based on performance prediction of applications on large simulated parallel machines • Simulation-based Performance Prediction: • Focus on Charm++ and AMPI programming models Performance prediction based on PDES • Supports varying levels of fidelity • processor prediction, network prediction. • Modes of execution : • online and post-mortem mode ICL@UTK

  49. Blue Gene Emulator/Simulator • Actually: BigSim, for simulation of any large machine using smaller parallel machines • Emulator: • Allows development of programming environment and algorithms before the machine is built • Allowed us to port Charm++ to real BG/L in 1-2 days • Simulator: • Allows performance analysis of programs running on large machines, without access to the large machines • Uses Parallel Discrete Event Simulation ICL@UTK

  50. Architecture of BigNetSim ICL@UTK