# Dynamic Load Balancing on Graphics Processors

A study conducted by Chalmers University of Technology on methods and experimental evaluation of dynamic load balancing on graphics processors. The problem setting involves tasks both offline and online with static load balancing on processors.

- Uploaded on | 6 Views
- pramitha

## About Dynamic Load Balancing on Graphics Processors

PowerPoint presentation about 'Dynamic Load Balancing on Graphics Processors'. This presentation describes the topic on A study conducted by Chalmers University of Technology on methods and experimental evaluation of dynamic load balancing on graphics processors. The problem setting involves tasks both offline and online with static load balancing on processors.. The key topics included in this slideshow are . Download this presentation absolutely free.

## Presentation Transcript

1. On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology

2. Overview Motivation Methods Experimental evaluation Conclusion

3. The problem setting Work Task Task Task Task Task Task Task Task Task Task Task Offline Online

4. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor

5. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task

6. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task

7. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task Subtask Subtask Subtask Subtask Subtask Subtask Subtask Subtask

8. Static Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task Subtask Subtask Subtask Subtask Subtask Subtask Subtask Subtask

9. Dynamic Load Balancing Processor Processor Processor Processor Processor Processor Processor Processor Task Task Task Task Task Task Task Task Subtask Subtask Subtask Subtask Subtask Subtask Subtask Subtask

10. Task sharing Work done? Try to get task New tasks? Perform task Got task? Add task Task Set No, retry Check condition Acquire Task Add Task No, continue Task Task Task Task Task Task Task Task Task Task Done

11. System Model CUDA Global Memory Gather and scatter Compare-And-Swap Fetch-And-Inc Multiprocessors Maximum number of concurrent thread blocks Multi- processor Thread Block Thread Block Thread Block Multi- processor Thread Block Thread Block Thread Block Multi- processor Thread Block Thread Block Thread Block Global Memory

12. Synchronization Blocking Uses mutual exclusion to only allow one process at a time to access the object. Lockfree Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations nishes in a nite number of its own steps. Waitfree Multiple processes can access the object concurrently. Every operation nishes in a nite number of its own steps.

13. Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List

14. Blocking queue TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail

15. Blocking queue TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail

16. Blocking queue T 1 TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail

17. Blocking queue T 1 TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail

18. Blocking queue T 1 TB 1 TB 1 TB 2 TB 2 TB n TB n Free Free Head Tail

19. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail Reference P. Tsigas and Y. Zhang, A simple, fast and scalable non- blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]

20. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail

21. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail

22. Non-blocking Queue T 1 T 2 T 3 T 4 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail

23. Non-blocking Queue T 1 T 2 T 3 T 4 T 5 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail

24. Non-blocking Queue T 1 T 2 T 3 T 4 T 5 TB 1 TB 1 TB 2 TB 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Head Tail

25. Task stealing T 1 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n Reference Arora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]

26. Task stealing T 1 T 4 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n

27. Task stealing T 1 T 4 T 5 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n

28. Task stealing T 1 T 4 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n

29. Task stealing T 1 T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n

30. Task stealing T 3 T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n

31. Task stealing T 2 TB 1 TB 1 TB 2 TB 2 TB n TB n

32. Static Task List T 1 T 2 T 3 T 4 In

33. Static Task List T 1 T 2 T 3 T 4 In TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4

34. Static Task List T 1 T 2 T 3 T 4 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4

35. Static Task List T 1 T 2 T 3 T 4 T 5 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4

36. Static Task List T 1 T 2 T 3 T 4 T 5 T 6 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4

37. Static Task List T 1 T 2 T 3 T 4 T 5 T 6 T 7 In Out TB 1 TB 1 TB 2 TB 2 TB 3 TB 3 TB 4 TB 4

38. Octree Partitioning Bandwidth bound

39. Octree Partitioning Bandwidth bound

40. Octree Partitioning Bandwidth bound

41. Octree Partitioning Bandwidth bound

42. Four-in-a-row Computation intensive

43. Graphics Processors 8800GT 14 Multiprocessors 57 GB/sec bandwidth 9600GT 8 Multiprocessors 57 GB/sec bandwidth

44. Blocking Queue Octree/9600GT

45. Blocking Queue Octree/8800GT

46. Blocking Queue Four-in-a-row

47. Non-blocking Queue Octree/9600GT

48. Non-blocking Queue Octree/8800GT

49. Non-blocking Queue - Four-in-a-row

50. Task stealing Octree/9600GT

51. Task stealing Octree/8800GT

52. Task stealing Four-in-a-row

53. Static List

54. Octree Comparison

55. Previous work Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003 Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998 Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer , Graphics Hardware 2005

56. Conclusion Synchronization plays a significant role in dynamic load- balancing Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming Locks perform poorly It is good that operations such as CAS and FAA have been introduced in the new GPUs Work stealing could outperform static load balancing

57. Thank you! http://www.cs.chalmers.se/~dcs