Direct3D 10 and Beyond: The Evolution of GPU

Direct3D 10 and Beyond: The Evolution of GPU
paly

Explore the latest step in GPU evolution with Direct3D 10 system, a large and complex system coming to millions of PCs. Discuss motivations, prior work, and increasing programmability for high level shading languages and CPU-like features.

About Direct3D 10 and Beyond: The Evolution of GPU

PowerPoint presentation about 'Direct3D 10 and Beyond: The Evolution of GPU'. This presentation describes the topic on Explore the latest step in GPU evolution with Direct3D 10 system, a large and complex system coming to millions of PCs. Discuss motivations, prior work, and increasing programmability for high level shading languages and CPU-like features.. The key topics included in this slideshow are . Download this presentation absolutely free.

Presentation Transcript


1. Direct3D 10 and Beyond Direct3D 10 and Beyond Peter-Pike Sloan Microsoft Corporation

2. The Direct3D 10 System The Direct3D 10 System Latest step in GPU evolution Coming to millions of PCs near you Large, complex system General overview and a few highlights Motivations Discuss current post Direct3D 10 thoughts

3. Prior Work Prior Work Fixed Function Hardware Programmable Vertex Processing Programmable Fragment Processing 2001 2002-3 < 2001 Primitive Processing Unified Programming 2004+ Direct3D 7 OpenGL 1.4 Direct3D 8 OpenGL 1.5 Direct3D 9 OpenGL 2.0 Direct3D 10 Assembly Programming High Level Shading Languages More CPU-like features Ad Hoc Multipass Increasing programmability

4. Design Process Design Process Collaboration with Application Developers (ISVs) Hardware Developers (IHVs) Iterative process Start - spring 2003 Spec - fall 2004 HW implementations - 2006 DirectX Team ISV 1 ISV 2 ISV n IHV 1 IHV 2 IHV m

5. Constraints & Problems Constraints & Problems Preserve data parallelism memory system efficiency coherence determinism Performance/$$ Improve state change agility implementation consistency program expressiveness resource limitations CPU offload Visual Complexity

6. Guiding Decisions Guiding Decisions Narrow gap between abstraction and implementation Improve overall system efficiency Avoid undefined behavior Avoid defacto defined behavior problems Avoid promising generality that cant be delivered If you specify CPU generality, you will get CPU performance No new API support for older hardware Allows fixed feature set, tighter behavior compliance Cull unnecessary fixed-functions Performance-per-watt and -per-$$ informs what to retain

7. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Logical pipeline Programmers view

8. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Input assembler Fixed-function Canonicalize vertex data Generate IDs Primitive, vertex, instance

9. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Vertex shader Programmable Vertex transformations 1 vertex in, 1 out Read from memory

10. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Geometry Shader New, programmable Per-primitive processing 1 prim in, k prims out Read from memory

11. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Stream Out New, fixed-function Divert primitive data to 1D buffers 1 in, 1 out Write to memory

12. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Setup/Rasterization Fixed-function Clipping, divide by w Convert primitives to fragments 1 prim in, m frags out

13. Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer System Architecture System Architecture Pixel Shader Programmable Shade fragments 1 frag in, 0 or 1 out Read from memory

14. System Architecture System Architecture Output Merger Fixed function Depth/stencil tests Color buffer blending Read/modify/write to memory Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer

15. System Architecture System Architecture Common programmable Core Same ISA Flexible memory objects Reuse at different stages Array forms of memory objects Indexes generated in shaders Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer Texture Texture Texture Color Depth

16. Geometry Shader Geometry Shader Entire primitive as input Adjacency Optional Outputs zero or more primitives 1024 scalars out max

17. Geometry Shader Geometry Shader Programmable Setup Generate barycentric coordinates, interpolate arbitrary amount of data downstream Quadratic interpolation over triangles Data stored/computed at edge midpoints Basis functions simple polynomials of barycentric coordinates Analytic gradients (0,0) (1,0) (0,1)

18. Geometry Shader Geometry Shader Amplify geometry Expand Point Sprites Extrude silhouettes Extrude prisms/tets [Hirche04]

19. Geometry Shader Geometry Shader Generate Array Index for render target array E.g., render to cube map Treat cube map as 6-element array Emit primitive multiple times Per-cube face transform + array index GS 1 2 3 4 5 0 Render Target Array

20. Determinism & Parallelism Determinism & Parallelism Allow parallel processing but preserve serial order Buffer GS outputs (on chip) Limit output to 1K 32-bit values Application can specify less May allow greater parallelism 1 2 n Expansion to 2 triangles GS GS GS

21. Stream Out Stream Out Data from VS/GS can optionally be streamed out to a buffer 32 bits per component (int or float) Either single buffer of up to 16 elements (64 scalars max) with flexible stride Up to 4 buffers that have single elements and unit element stride Always sent to rasterizer if rasterizer is enabled

22. Stream Out Stream Out Generated geometry easily redrawn using DrawAuto() command with no CPU intervention DrawAuto() DrawAuto()

23. Multi-Stream Output Multi-Stream Output Array-of-structures vs. structure-of-arrays Position Color Normal Texture Position Color Normal Texture Input Assembler supports both types as vertex buffers Both styles are useful Access pattern vs. memory coherency Position Position Color Color Normal Normal Texture Texture . . . . . . . . . . . .

24. Multi-Stream Output Multi-Stream Output Add multiple stream capability Compromise - support 1 multi-element stream with up to 16 elements (AoS) Up to 4 single-element (SoA) streams Future expansion

25. Programmability Programmability Virtual machine model Machine-independent intermediate language (IL) Just in time translation (JIT) in hardware driver When shader program object is created HLSL Compiler HLSL Program IL JIT in Driver Program Object

26. The Virtual Machine The Virtual Machine New Features Integer instruction set Load instruction (no store!) IEEE-754 format & ~accuracy Separate samplers & textures Writable private memory Direct3D 9 Direct3D 10 Instructions 64K/512 unlimited Textures 16 128 Temporary registers 16 32 Constants 256 4Kx16 Interstage registers 16 32 2D texture 4Kx4K 8Kx8K Render targets 4 8

27. Texture C Shading A Triangle Shading A Triangle Static light positions Dynamic light positions Camera positions View/Projection Matrices Bone Matrices LOD Material Parameters Normals, Positions, Texcoords Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture B Vertex Shader Pixel Shader

28. Texture C Constants in Direct3D 9 Constants in Direct3D 9 Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture B Vertex Shader Pixel Shader VS PS constants SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() . . .

29. Texture C Constants in Direct3D 9 Constants in Direct3D 9 Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Texture A Texture B Vertex Shader Pixel Shader VS PS constants SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() SetConstant() . . .

30. Per-Level Data Per-Frame Data Per-Instance Data Per-Primitive Data Per-Vertex Data Constant Buffers Constant Buffers Split parameters into buffers Organize by update frequency Bulk update any buffer Bind up to 16 buffers/shader Sounds like 1D textures But, access pattern is different Uniform vs. Non-uniform index Frequent vs. Infrequent access

31. API/Runtime API/Runtime Plumbing for Creating/managing objects Binding state to pipeline stages Restructure for efficiency & flexibility Aggregate bits of state into large objects More real work done per API call Group related state together (blend, raster, stencil, depth) Guide hardware implementation

32. Configuring the Pipeline Configuring the Pipeline IASetVertexBuffers/SetIndexBuffer IASetPrimitiveTopology {VS|GS|PS}SetShader {VS|GS|PS}SetShaderResources {VS|GS|PS}SetConstantBuffers {VS|GS|PS}SetSamplers SOSetTargets RSSetState RSSetViewports/ScissorRects OMSetRenderTargets OMSetBlendState OMSetDepthStencilState Vertex Shader Geometry Shader Pixel Shader Input Assembler Setup/ Rasterization Output Merger Stream Out Memory Vertex Buffer Texture Depth Texture Texture Color Index Buffer Buffer

33. Shading Language Shading Language HLSL is the real API? Shader programs considered part of art assets! Support new instructions (integer, load, ) Parameter grouping into constant buffers Geometry Shader Multiple input vertices, multiple output support (emit & reset) Intrinsics for stream output Avoid features with large run-time (CPU) cost E.g., requiring re-compilation if state changes

34. Particle System Example Particle System Example No CPU intervention Particle state in 1D buffer Read buffer and rewrite 2 nd buffer each pass Use GS to add or destroy particles

35. Displacement Map Example Displacement Map Example GS extrudes prism at each face [Hirche04] PS ray casts against height field Shade or discard pixel depending on ray test

36. Instancing Example Instancing Example GS can determine shader, instance and primitive IDs used to index texture array

37. Sparse Morph Targets Sparse Morph Targets Render to VB updates vertices GS uses stretch of triangle to drive wrinkles

38. Other Ideas Considered Other Ideas Considered Programmable Input Assembler Unwarranted complexity Tessellation Complexity too high for this design (deferred) Access to color/depth buffer from pixel shader Prohibitive performance implications Simultaneous read/write access to memory Unpredictable results non-determinism Scatter, reduction operations Performance vs. determinism issues (deferred)

39. Results Results State change agility State objects, constant buffers, instancing, array resources Greater expressiveness & flexibility Integer, load, etc. instructions; stream out, flexible memory objects Fewer resource constraints Huge increase in resources (hardware cost) Feature consistency Very tight behavioral specification (f eature set, arithmetic tolerances) 2 optional features (multisampling, 32-bit float texture filtering) CPU Offload Memory model, geometry shader, stream out, predicated rendering,

40. Acknowledgements Acknowledgements Numerous software and hardware companies contributed to the design ATI Epic NCsoft Autodesk nVidia id SOE RAD Intel Valve Ubisoft XSI S3 Blizzard Naughty Dog Discreet 3Dlabs Ritual Lucas Arts Alias XGI Crytek Emogence DirectX team PowerVR Bungie Lionhead GameFu Matrox Monolith EA

41. Post Direct3D 10 Post Direct3D 10

42. Direct3D 10.1 Direct3D 10.1 Small improvements for important problems Limited to small hardware changes More VS GS inter-stage registers, VS input Cube map arrays Multi-sample control (patterns, alpha to cvg) Better multi-sample color & depth access Per-render target blending modes API/runtime enhancements for multi-core Precision improvements

43. Future: Addressing GPU Evolution Future: Addressing GPU Evolution Direct3D10+ Raytracing REYES GPGPU Physics ? ? ? ? Multi-GPU ?

44. Complexity & Balance Complexity & Balance Increase realism/fidelity in weaker areas Complexity inflection points require new techniques Geometric Material Lighting Transport Complexity - Quality Animation Dynamics Visual Attribute normalized by importance

45. Problems to Solve Problems to Solve Content Generation Create more artwork faster 20+ GB of content to be created Preserve content investment Better Visuals Silhouette edges, transparency, antialiasing, texture filtering Non-rendering computation Physics, animation, morphing Programmability Fixed functions vs. programmability

46. Content Generation Content Generation Tackle two areas inflection points Texture maps Currently hand painted, 2K2K 4K4K Transition to procedural methods (long term) Improve texture management Character modeling with detail and deformation Currently skinned polygonal models with normal maps Transition to deformable subdiv patches with displacement & normal maps

47. Tessellation Tessellation Primary motivator is amplification of animation/morph targets/deformation models Everything stays on GPU if possible Displacement mapped surfaces become first class primitives

48. Displaced Subdivision Displaced Subdivision Images Fantasy Lab and Wizards of the Coast

49. Three-Domain Pipeline Three-Domain Pipeline Patches 16 - 24 control points Low frequency phenomena (animation, vector irradiance?, indirect vector irradiance?) Triangles 3 vertices Mid frequency phenomena Pixel fragments n-samples per pixel High frequency phenomena (gloss, material roughness)

50. (Logical) Pipeline Evolution (Logical) Pipeline Evolution Vertex Vertex Shader Shader Setup Setup Rasterizer Rasterizer Output Output Merger Merger Pixel Pixel Shader Shader Geometry Geometry Shader Shader Texture Texture Texture Texture Render Render Target Target Depth Depth Stencil Stencil Texture Texture Stream Stream Buffer Buffer Stream Stream out out Memory Memory memory memory programmable programmable fixed fixed Sampler Sampler Sampler Sampler Sampler Sampler Constant Constant Constant Constant Constant Constant Vertex Vertex Buffer Buffer Input Input Assembler Assembler Index Index Buffer Buffer Tessellator Tessellator Control Control Point Point Shader Shader Texture Texture Sampler Sampler Constant Constant (Hypothetical!) Spill patch data to memory?

51. Tessellation with Displacement Tessellation with Displacement Integration into art pipeline Surface formats (SubD, bi-cubic patches)? Approximation? How much tessellation? Adaptive? How does it fit into the logical pipeline? New stages? How many? Try and keep everything on chip? Updating control cage, multi-pass makes sense Conversion to other basis multi-pass?

52. Displacement Mapping Displacement Mapping Vertex Based How much tessellation? Interaction with fractional tessellation? Are more sophisticated tessellation schemes required? Local Ray Tracing How inexpensive can you make shaders? Interaction with MSAA? Shadows? Interaction with hierarchical/early Z?

53. Improving Visual Quality Improving Visual Quality Many areas to improve Some solved with programmability + performance But not all Texture filter quality Texture compression; e.g., HDR images Derivatives Order independent transparency Antialiasing quality Global illumination Static/Parameterized GI Dynamic GI Ray casting/tracing

54. Transparency and Antialiasing Transparency and Antialiasing Current state of art 4 - 8 sample multisample antialiasing n + 1 levels of transparency Transparency Feathered edges (foliage) Windshields Particles Sort transparent objects Alpha to coverage for alpha textures (avoid blend)

55. Transparency and Antialiasing Transparency and Antialiasing Need to do better Sorting too expensive Must work with multipass algorithms E.g., apply shadow maps Move sort to hardware Track individual pixel fragments A-buffer (cf. R-buffer, F-buffer, T-buffer)

56. A-Buffer A-Buffer Save all fragments and sort Memory intensive (64 fragments/pixel) Tiling to reduce memory constraint? Discard overflow fragments? Operations on pre-resolved fragments Shadow computations, multipass layers

57. A-Buffer A-Buffer Fixed-function or programmable? Sorting/resolve operation Overflow handling What happens to MSAA, MRT Fragment = attributes + coverage + depth Defer explosion to samples until resolve-time Opportunity to do better antialiasing

58. A-Buffer Implications A-Buffer Implications Separate opaque and transparent object processing Draw opaque first to cull invisible transparent fragments Switch to tiling (chunking) to save memory cf. predicated tiling on Xbox 360 How much memory for fragments (100MB?) Exacerbated by larger displays Render at reduced resolution and up-sample? Opportunity for better resolve filtering Filter support larger than a pixel

59. Non-rendering Computations Non-rendering Computations Direct3D 10 enables new computations using Additional programmability Integer & load instructions More general data flow Render to vertex buffer Animation + skinning Solved problem? Particle systems, Morphing

60. GPU vs. Multicore CPU GPU vs. Multicore CPU GPU large flops, memory bandwidth Data parallelism, streaming caches Multi-core CPU Task parallelism, cache locality Boundary between the two is fuzzy Matrix multiply, sparse matrix x sparse vector Convergence?

61. Programmability Programmability Direct3D 10 computationally complete? Make entire pipeline programmable? Some processing more efficient as fixed function Set-up, Rasterization Hiearchical Z Filtering (does it need 32-bit float?) Clipping (do you really want to write that code?) Orthogonality Keep data types/formats independent from algorithms

62. Programmability Programmability Every function we remove You may need to add back in shader code E.g., suppose we enable alpha-to-coverage in shaders Compute coverage mask and output in pixel shader Do we keep the fixed-function version? If removed, then all (pixel) shaders need to implement alpha- to-coverage Developer implements virtual pipeline in shaders HLSL/FX provides support for implementing virtual pipelines Can we do more?

63. Dynamic Subroutines Dynamic Subroutines Do dynamic subroutines simplify/solve the problem? i.e., shaders with function pointers Call overhead must be tiny Otherwise, end up inlining and recompiling Can I dynamically stack (append) subroutines? A B C Or do subroutines need to have static call sites (bind points)? A0 B A1 C A2 D

64. Programmability Programmability Next steps Efficient dynamic subroutine mechanism Eliminate combinatorial explosions Allow shader composition through libraries Need efficient dynamic binding cf. version 1.0 Fragment Linker in Direct3D 9 Generalized data parallel computation Neighbor communication? Scatter? Read-modify-write operations to memory

65. Summary Summary Lots to figure out! Better texturing Surface Tessellation Transparency/Anti-aliasing General computation

66. Acknowledgements Acknowledgements DirectX group David Blythe, Michael Bunnell, Shanon Drone , Sam Glassenberg, Michael Oneppo IHVs/ISVs [see earlier slide]

67. Questions? Questions? no dates/promises for anything post Direct3D 10