Dynamic Load Balancing on Graphics Processors


A study conducted by Chalmers University of Technology on methods and experimental evaluation of dynamic load balancing on graphics processors. The problem setting involves tasks both offline and online with static load balancing on processors.
- Uploaded on | 6 Views
-
pramitha
About Dynamic Load Balancing on Graphics Processors
PowerPoint presentation about 'Dynamic Load Balancing on Graphics Processors'. This presentation describes the topic on A study conducted by Chalmers University of Technology on methods and experimental evaluation of dynamic load balancing on graphics processors. The problem setting involves tasks both offline and online with static load balancing on processors.. The key topics included in this slideshow are . Download this presentation absolutely free.
Presentation Transcript
1. On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology
2. Overview Motivation Methods Experimental evaluation Conclusion
3. The problem setting Work Task Task Task Task Task Task Task Task Task Task Task Offline Online
4. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor
5. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task
6. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task
7. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task Subtask Subtask Subtask Subtask Subtask Subtask Subtask Subtask
8. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task Subtask Subtask Subtask Subtask Subtask Subtask Subtask Subtask
9. Dynamic Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task Subtask Subtask Subtask Subtask Subtask Subtask Subtask Subtask
10. Task sharing Work done? Try to get task New tasks? Perform task Got task? Add task Task Set No, retry Check condition Acquire Task Add Task No, continue Task Task Task Task Task Task Task Task Task Task Done
11. System Model CUDA Global Memory Gather and scatter Compare-And-Swap Fetch-And-Inc Multiprocessors Maximum number of concurrent thread blocks Multi- processor Thread Block Thread Block Thread Block Multi- processor Thread Block Thread Block Thread Block Multi- processor Thread Block Thread Block Thread Block Global Memory
12. Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations nishes in a nite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation nishes in a nite number of its own steps.
13. Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List
14. Blocking queue TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail
15. Blocking queue TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail
16. Blocking queue T 1 TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail
17. Blocking queue T 1 TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail
18. Blocking queue T 1 TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail
19. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non- blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]
20. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail
21. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail
22. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail
23. Non-blocking Queue T 1 T 2 T 3 T 4 T 5 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail
24. Non-blocking Queue T 1 T 2 T 3 T 4 T 5 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail
25. Task stealing T 1 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
26. Task stealing T 1 T 4 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n
27. Task stealing T 1 T 4 T 5 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n
28. Task stealing T 1 T 4 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n
29. Task stealing T 1 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n
30. Task stealing T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n
31. Task stealing T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n
32. Static Task List T 1 T 2 T 3 T 4 In
33. Static Task List T 1 T 2 T 3 T 4 In TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4
34. Static Task List T 1 T 2 T 3 T 4 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4
35. Static Task List T 1 T 2 T 3 T 4 T 5 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4
36. Static Task List T 1 T 2 T 3 T 4 T 5 T 6 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4
37. Static Task List T 1 T 2 T 3 T 4 T 5 T 6 T 7 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4
38. Octree Partitioning Bandwidth bound
39. Octree Partitioning Bandwidth bound
40. Octree Partitioning Bandwidth bound
41. Octree Partitioning Bandwidth bound
42. Four-in-a-row Computation intensive
43. Graphics Processors 8800GT 14 Multiprocessors 57 GB/sec bandwidth 9600GT 8 Multiprocessors 57 GB/sec bandwidth
44. Blocking Queue Octree/9600GT
45. Blocking Queue Octree/8800GT
46. Blocking Queue Four-in-a-row
47. Non-blocking Queue Octree/9600GT
48. Non-blocking Queue Octree/8800GT
49. Non-blocking Queue - Four-in-a-row
50. Task stealing Octree/9600GT
51. Task stealing Octree/8800GT
52. Task stealing Four-in-a-row
53. Static List
54. Octree Comparison
55. Previous work Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003 Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998 Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer , Graphics Hardware 2005
56. Conclusion Synchronization plays a significant role in dynamic load- balancing Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming Locks perform poorly It is good that operations such as CAS and FAA have been introduced in the new GPUs Work stealing could outperform static load balancing
57. Thank you! http://www.cs.chalmers.se/~dcs