Propelled Charm Tutorial - PowerPoint PPT Presentation

advanced charm tutorial l.
Skip this Video
Loading SlideShow in 5 Seconds..
Propelled Charm Tutorial PowerPoint Presentation
Propelled Charm Tutorial

play fullscreen
1 / 73
Download Presentation

Propelled Charm Tutorial

Presentation Transcript

  1. Advanced Charm++ Tutorial Charm Workshop Tutorial Sameer Kumar Orion Sky Lawlor 2004/10/19

  2. How to Become a Charm++ Hacker • Advanced Charm++ • Advanced Messaging • Writing system libraries • Groups • Delegation • Communication framework • Advanced load-balancing • Checkpointing • Threads • SDAG

  3. Advanced Messaging

  4. Prioritized Execution • If several messages available, Charm will process the message with highest priority • Otherwise, oldest message (FIFO) • Has no effect: • If only one message is available (common for network-bound applications!) • On outgoing messages • Very useful for speculative work, ordering timesteps, etc...

  5. Priority Classes • Charm++ scheduler has three queues: high, default, and low • As signed integer priorities: • -MAXINT Highest priority -- -1 • 0 Default priority • 1 -- +MAXINT Lowest priority • As unsigned bitvector priorities: • 0x0000 Highest priority -- 0x7FFF • 0x8000 Default priority • 0x8001 -- 0xFFFF Lowest priority

  6. Prioritized Marshalled Messages • Pass “CkEntryOptions” as last parameter • For signed integer priorities: CkEntryOptions opts; opts.setPriority(-1);,y,opts); • For bitvector priorities: CkEntryOptions opts; unsigned int prio[2]={0x7FFFFFFF,0xFFFFFFFF}; opts.setPriority(64,prio);,y,opts);

  7. Prioritized Messages • Number of priority bits passed during message allocation FooMsg * msg = new (size, nbits) FooMsg; • Priorities stored at the end of messages • Signed integer priorities: *CkPriorityPtr(msg)=-1; CkSetQueueing(m, CK_QUEUEING_IFIFO); • Unsigned bitvector priorities CkPriorityPtr(msg)[0]=0x7fffffff; CkSetQueueing(m, CK_QUEUEING_BFIFO);

  8. Advanced Message Features • Read-only messages • Entry method agrees not to modify or delete the message • Avoids message copy for broadcasts, saving time • Expedited messages • Message do not go through the charm++ scheduler (faster) • Immediate messages • Entries are executed in a interrupt or the communication thread • Very fast, but tough to get right

  9. Read-Only, Expedited, Immediate • All declared in the .ci file { .. ... entry [nokeep] void foo_readonly(Msg *); entry [expedited] void foo_exp(Msg *); entry [immediate] void foo_imm(Msg *); .. .. .. }; // Immediate messages only currently work //for NodeGroups

  10. Groups

  11. Object Groups • A collection of objects (chares) • Also called branch office chares • Exactly one representative on each processor • Ideally suited for system libraries • A single proxy for the group as a whole • Similar to arrays: • Broadcasts, reductions, indexing • But not completely like arrays: • Non-migratable; one per processor

  12. Declarations • .ci file group mygroup { entry mygroup(); //Constructor entry void foo(foomsg *); //Entry method }; • C++ file class mygroup : public Group { mygroup() {} void foo(foomsg *m) { CkPrintf(“Do Nothing”);} };

  13. Creating and Calling Groups • Creation p = CProxy_mygroup::ckNew(); • Remote invocation; //broadcast p[1].foo(msg); //asynchronous invocation • Direct local access mygroup *g=p.ckLocalBranch(); g->foo(….); //local invocation • Danger: if you migrate, the group stays behind!

  14. Delegation

  15. Delegation • Enables Charm++ proxy messages to be forwarded to a delegation manager group • Delegation manager can trap calls to proxy sends and apply optimizations • Delegation manager must inherit from CkDelegateMgr • User program must to call • proxy.ckDelegate(mgrID);

  16. Delegation Interface • .ci file group MyDelegateMgr { entry MyDelegateMgr(); //Constructor }; • .h file class MyDelegateMgr : public CkDelegateMgr { MyDelegateMgr(); void ArraySend(...,int ep,void *m,const CkArrayIndexMax &idx,CkArrayID a); void ArrayBroadcast(..); void ArraySectionSend(.., CkSectionID &s); …………….. …………….. }

  17. Communication Optimization

  18. Automatic Communication Optimizations • The parallel-objects Runtime System can observe, instrument, and measure communication patterns • Communication libraries can optimize • By substituting most suitable algorithm for each operation • Learning at runtime • E.g. All to all communication • Performance depends on many runtime characteristics • Library switches between different algorithms • Communication is from/to objects, not processors • Streaming messages optimization

  19. Managing Collective Communication • Communication operation where all (or most) the processors participate • For example broadcast, barrier, all reduce, all to all communication etc • Applications: NAMD multicast, NAMD PME, CPAIMD • Issues • Performance impediment • Naïve implementations often do not scale • Synchronous implementations do not utilize the co-processor effectively

  20. All to All Communication • All processors send data to all other processors • All to all personalized communication (AAPC) • MPI_Alltoall • All to all multicast/broadcast (AAMC) • MPI_Allgather

  21. Strategies For AAPC • Short message optimizations • High software over head (α) • Message combining • Large messages • Network contention

  22. Short Message Optimizations • Direct all to all communication is α dominated • Message combining for small messages • Reduce the total number of messages • Multistage algorithm to send messages along a virtual topology • Group of messages combined and sent to an intermediate processor which then forwards them to their final destinations • AAPC strategy may send same message multiple times

  23. Virtual Topology: Mesh Phase 1: Processors send messages to row neighbors Phase 2: Processors send messages to column neighbors 2* messages instead of P-1 Organizeprocessors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2)

  24. AAPC Performance

  25. Large Message Issues • Network contention • Contention free schedules • Topology specific optimizations

  26. Ring Strategy for Collective Multicast …… …….. 0 1 2 i i+1 P-1 • Performs all to all multicast by sending messages along a ring formed by the processors • Congestion free on most topologies

  27. Streaming Messages • Programs often have streams of short messages • Streaming library combines a bunch of messages and sends them off • Stripping large charm++ header • Short array message packing • Effective message performance of about 3us

  28. Using communication library • Communication optimizations embodied as strategies • EachToManyMulticastStrategy • RingMulticast • PipeBroadcast • Streaming • MeshStreaming

  29. Bracketed vs. Non-bracketed • Bracketed Strategies • Require user to give specific end points for each iteration of message sends • Endpoints declared by calling ComlibBegin() and ComlibEnd() • Examples: EachToManyMulticast • Non bracketed strategies • No such end points necessary • Examples: Streaming, PipeBroadcast

  30. Accessing the Communication Library • From mainchare::main • Creating a strategy Strategy *strat = new EachToManyMulticastStrategy(USE_MESH) Strat = new StreamingStrategy(); Strat->enableShortMessagePacking(); • Associating a proxy with a Strategy ComlibAssociateProxy(strat, myproxy); • myproxy should be passed to all array elements

  31. Sending Messages ComlibBegin(myproxy);//Bracketed Strategies for(.. ..) { .. ..; .. .. } ComlibEnd(); //Bracketed strategies

  32. Handling Migration • Migrating array element PUP’s the comlib associated proxy FooArray::pup(PUP::er &p) { .. .. .. p | myProxy; .. .. .. }

  33. Compiling • You must include compile time option –module commlib

  34. Advanced Load-balancersWriting a Load-balancing Strategy

  35. Advanced load balancing: Writing a new strategy • Inherit from CentralLB and implement the work(…) function class foolb : public CentralLB { public: .. .. .. void work (CentralLB::LDStats* stats, int count); .. .. .. };

  36. LB Database struct LDStats { ProcStats *procs; LDObjData* objData; LDCommData* commData; int *to_proc; //.. .. .. } //Dummy Work function which assigns all objects to //processor 0 //Don’t implement it! void fooLB::work(CentralLB::LDStats* stats,int count){ for(int count=0;count < nobjs; count++) stats.to_proc[count] = 0; }

  37. Compiling and Integration • Edit and run • Creates which is included by the LDB Makefile • Run make depends to correct dependencies • Rebuild charm++

  38. Checkpoint Restart

  39. Checkpoint/Restart • Any long running application must be able to save its state • When you checkpoint an application, it uses the pup routine to store the state of all objects • State information is saved in a directory of your choosing • Restore also uses pup, so no additional application code is needed (pup is all you need)

  40. Checkpointing Job • In AMPI, use MPI_Checkpoint(<dir>); • Collective call; returns when checkpoint is complete • In Charm++, use CkCheckpoint(<dir>,<resume>); • Called on one processor; calls resume when checkpoint is complete

  41. Restart Job from Checkpoint • The charmrun option ++restart <dir> is used to restart • Number of processors need not be the same • You can also restart groups by marking them migratable and writing a PUP routine – they still will not load balance, though

  42. Threads

  43. Why use Threads? • They provide one key feature: blocking • Suspend execution (e.g., at message receive) • Do something else • Resume later (e.g., after message arrives) • Example: MPI_Recv, MPI_Wait semantics • Function call interface more convenient than message-passing • Regular call/return structure (no CkCallbacks) • Allows blocking in middle of deeply nested communication subroutine

  44. Why not use Threads? • Slower • Around 1us context-switching overhead unavoidable • Creation/deletion perhaps 10us • More complexity, more bugs • Breaks a lot of machines! (but we have workarounds) • Migration more difficult • State of thread is scattered through stack, which is maintained by compiler • By contrast, state of object is maintained by users • Thread disadvantages form the motivation to use SDAG (later)

  45. What are (Charm) Threads? • One flow of control (instruction stream) • Machine Registers & program counter • Execution stack • Like pthreads (kernel threads) • Only different: • Implemented at user level (in Converse) • Scheduled at user level; non-preemptive • Migratable between nodes

  46. How do I use Threads? • Many options: • AMPI • Always uses threads via TCharm library • Charm++ • [threaded] entry methods run in a thread • [sync] methods • Converse • C routines CthCreate/CthSuspend/CthAwaken • Everything else is built on these • Implemented using • SYSV makecontext/setcontext • POSIX setjmp/alloca/longjmp

  47. How do I use Threads (example) • Blocking API routine: find array element int requestFoo(int src) { myObject *obj=...; return obj->fooRequest(src) } • Send request and suspend int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); // -- blocks until awaken call -- return stashed_return; } • Awaken thread when data arrives void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }

  48. How do I use Threads (example) • Send request, suspend, recv, awaken, return int myObject::fooRequest(int src) { proxy[dest].fooNetworkRequest(thisIndex); stashed_thread=CthSelf(); CthSuspend(); return stashed_return; } void myObject::fooNetworkResponse(int ret) { stashed_return=ret; CthAwaken(stashed_thread); }

  49. The Horror of Thread Migration

  50. Stack Data • The stack is used by the compiler to track function calls and provide temporary storage • Local Variables • Subroutine Parameters • C “alloca” storage • Most of the variables in a typical application are stack data • Users have no control over how stack is laid out