Fundamentals of Computer Architecture and Technology Performance Improvement

Fundamentals of Computer Architecture and Technology Performance Improvement
paly

This chapter covers the basics of quantitative design and analysis in computer architecture, including advancements in semiconductor technology and computer architectures enabled by HLL compilers and UNIX leading to RISC architectures. It also explores how these advancements have enabled lightweight computers, productivity-based managed interpreted programming languages, SaaS, virtualization, cloud applications evolution, speech, sound, images, video, augmented extended reality, and big data.

About Fundamentals of Computer Architecture and Technology Performance Improvement

PowerPoint presentation about 'Fundamentals of Computer Architecture and Technology Performance Improvement'. This presentation describes the topic on This chapter covers the basics of quantitative design and analysis in computer architecture, including advancements in semiconductor technology and computer architectures enabled by HLL compilers and UNIX leading to RISC architectures. It also explores how these advancements have enabled lightweight computers, productivity-based managed interpreted programming languages, SaaS, virtualization, cloud applications evolution, speech, sound, images, video, augmented extended reality, and big data.. The key topics included in this slideshow are computer architecture, semiconductor technology, RISC architectures, HLL compilers, UNIX, lightweight computers, SaaS, virtualization, cloud applications, big data,. Download this presentation absolutely free.

Presentation Transcript


1. 1 Copyright 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative Approach, Fifth Edition

2. 2 Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Performance improvements: Improvements in semiconductor technology Feature size, clock speed Improvements in computer architectures Enabled by HLL compilers, UNIX Lead to RISC architectures Together have enabled: Lightweight computers Productivity-based managed/interpreted programming languages SaaS, Virtualization, Cloud Applications evolution: Speech, sound, images, video, augmented/extended reality, big data Introduction

3. 3 Copyright 2012, Elsevier Inc. All rights reserved. Single Processor Performance Introduction RISC Move to multi-processor

4. 4 Copyright 2012, Elsevier Inc. All rights reserved. Current Trends in Architecture Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in 2003 New models for performance: Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP) These require explicit restructuring of the application Introduction

5. 5 Copyright 2012, Elsevier Inc. All rights reserved. Classes of Computers Personal Mobile Device (PMD) e.g. smart phones, tablet computers (1.8 billion sold 2010) Emphasis on energy efficiency and real-time Desktop Computing Emphasis on price-performance (0.35 billion) Servers Emphasis on availability (very costly downtime!), scalability, throughput (20 million) Clusters / Warehouse Scale Computers Used for Software as a Service (SaaS), PaaS, IaaS, etc. Emphasis on availability ($6M/hour-downtime at Amazon.com!) and price-performance (power=80% of TCO!) Sub-class: Supercomputers, emphasis: floating-point performance and fast internal networks, and big data analytics Embedded Computers (19 billion in 2010) Emphasis: price Classes of Computers

6. 6 Copyright 2012, Elsevier Inc. All rights reserved. Parallelism Classes of parallelism in applications: Data-Level Parallelism (DLP) Task-Level Parallelism (TLP) Classes of architectural parallelism: Instruction-Level Parallelism (ILP) Vector architectures/Graphic Processor Units (GPUs) Thread-Level Parallelism Request-Level Parallelism Classes of Computers

7. 7 Copyright 2012, Elsevier Inc. All rights reserved. Flynns Taxonomy Single instruction stream, single data stream (SISD) Single instruction stream, multiple data streams (SIMD) Vector architectures Multimedia extensions Graphics processor units Multiple instruction streams, single data stream (MISD) No commercial implementation Multiple instruction streams, multiple data streams (MIMD) Tightly-coupled MIMD Loosely-coupled MIMD Classes of Computers

8. 8 Copyright 2012, Elsevier Inc. All rights reserved. Defining Computer Architecture Old view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding : registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding Real computer architecture: Specific requirements of the target machine Design to maximize performance within constraints: cost, power, and availability Includes ISA, microarchitecture, hardware Defining Computer Architecture

9. 9 Copyright 2012, Elsevier Inc. All rights reserved. Trends in Technology Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year 15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year 15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM Trends in Technology

10. 10 Copyright 2012, Elsevier Inc. All rights reserved. Bandwidth and Latency Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors over the 1st milestone 300-1200X improvement for memory and disks over the 1st milestone Latency or response time Time between start and completion of an event 30-80X improvement for processors over the 1st milestone 6-8X improvement for memory and disks over the 1st milestone Trends in Technology

11. 11 Copyright 2012, Elsevier Inc. All rights reserved. Bandwidth and Latency Log-log plot of bandwidth and latency milestones Trends in Technology

12. 12 Copyright 2012, Elsevier Inc. All rights reserved. Transistors and Wires Feature size Minimum size of transistor or wire in x or y dimension 10 microns in 1971 to .032 microns in 2011 Transistor performance scales linearly Wire delay does not improve with feature size! Integration density scales quadratically Linear performance and quadratic density growth present a challenge and opportunity, creating the need for computer architect! Trends in Technology

13. 13 Copyright 2012, Elsevier Inc. All rights reserved. Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP) Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power, higher than average power consumption Clock rate can be reduced dynamically to limit power consumption Energy per task is often a better measurement Trends in Power and Energy

14. 14 Copyright 2012, Elsevier Inc. All rights reserved. Dynamic Energy and Power Dynamic energy Transistor switch from 0 -> 1 or 1 -> 0 x Capacitive load x Voltage 2 Dynamic power x Capacitive load x Voltage 2 x Frequency switched Reducing clock rate reduces power, not energy Trends in Power and Energy

15. 15 Copyright 2012, Elsevier Inc. All rights reserved. Power Intel 80386 consumed ~ 2 W 3.3 GHz Intel Core i7 consumes 130 W Heat must be dissipated from 1.5 x 1.5 cm chip This is the limit of what can be cooled by air Trends in Power and Energy

16. 16 Copyright 2012, Elsevier Inc. All rights reserved. Reducing Power Techniques for reducing power: Do nothing well Dynamic Voltage-Frequency Scaling Low power state for DRAM, disks Overclocking, turning off cores Trends in Power and Energy

17. 17 Copyright 2012, Elsevier Inc. All rights reserved. Static Power Static power consumption Current static x Voltage Scales with number of transistors To reduce: power gating Race-to-halt The new primary evaluation for design innovation Tasks per joule Performance per watt Trends in Power and Energy

18. 18 Copyright 2012, Elsevier Inc. All rights reserved. Trends in Cost Cost driven down by learning curve Yield DRAM: price closely tracks cost Microprocessors: price depends on volume 10% less for each doubling of volume Trends in Cost

19. 19 Copyright 2012, Elsevier Inc. All rights reserved. Integrated Circuit Cost Integrated circuit Bose-Einstein formula: Defects per unit area = 0.016-0.057 defects per square cm (2010) N = process-complexity factor = 11.5-15.5 (40 nm, 2010) The manufacturing process dictates the wafer cost, wafer yield and defects per unit area The architects design affects the die area, which in turn affects the defects and cost per die Trends in Cost

20. 20 Copyright 2012, Elsevier Inc. All rights reserved. Dependability Systems alternate between two states of service with respect to SLA/SLO: 1. Service accomplishment, where service is delivered as specified by SLA 2. Service interruption, where the delivered service is different from the SLA Module reliability: failure(F)=transition from 1 to 2 and repair(R)=transition from 2 to 1 Mean time to failure (MTTF) Mean time to repair (MTTR) Mean time between failures (MTBF) = MTTF + MTTR Availability = MTTF / MTBF Dependability

21. 21 Copyright 2012, Elsevier Inc. All rights reserved. Measuring Performance Typical performance metrics: Response time Throughput Speedup of X relative to Y Execution time Y / Execution time X Execution time Wall clock time: includes all system overheads CPU time: only computation time Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C) Measuring Performance

22. 22 Copyright 2012, Elsevier Inc. All rights reserved. Principles of Computer Design Take Advantage of Parallelism e.g. multiple processors, disks, memory banks, pipelining, multiple functional units Principle of Locality Reuse of data and instructions Focus on the Common Case Amdahls Law Principles

23. 23 Copyright 2012, Elsevier Inc. All rights reserved. Principles of Computer Design The Processor Performance Equation Principles

24. 24 Copyright 2012, Elsevier Inc. All rights reserved. Principles of Computer Design Principles Different instruction types having different CPIs

25. ISA CSCE430/830 Instruction Set Architecture (ISA) Serves as an interface between software and hardware. Provides a mechanism by which the software tells the hardware what should be done . instruction set High level language code : C, C++, Java, Fortran , hardware Assembly language code: architecture specific statements Machine language code: architecture specific bit patterns software compiler assembler

26. ISA CSCE430/830 Instruction Set Design Issues Instruction set design issues include: Where are operands stored? registers, memory, stack, accumulator How many explicit operands are there? 0, 1, 2, or 3 How is the operand location specified? register, immediate, indirect, . . . What type & size of operands are supported? byte, int, float, double, string, vector. . . What operations are supported? add, sub, mul, move, compare . . .

27. ISA CSCE430/830 Classifying ISAs Accumulator (before 1960, e.g. 68HC11 ): 1-address add A acc acc + mem[A] Stack (1960s to 1970s): 0-address add tos tos + next Memory-Memory (1970s to 1980s): 2-address add A, B mem[A] mem[A] + mem[B] 3-address add A, B, C mem[A] mem[B] + mem[C] Register-Memory (1970s to present, e.g. 80x86 ): 2-address add R1, A R1 R1 + mem[A] load R1, A R1 mem[A] Register-Register (Load/Store, RISC) (1960s to present, e.g. MIPS ): 3-address add R1, R2, R3 R1 R2 + R3 load R1, R2 R1 mem[R2] store R1, R2 mem[R1] R2

28. ISA CSCE430/830 Operand Locations in Four ISA Classes GPR

29. ISA CSCE430/830 Code Sequence C = A + B for Four Instruction Sets Code Sequence C = A + B for Four Instruction Sets Stack Accumulator Register (register-memory) Register (load- store) Push A Push B Add Pop C Load A Add B Store C Load R1, A Add R1, B Store C, R1 Load R1,A Load R2, B Add R3, R1, R2 Store C, R3 memory memory acc = acc + mem[C] R1 = R1 + mem[C] R3 = R1 + R2

30. ISA CSCE430/830 Types of Addressing Modes (VAX) Addressing Mode Example Action 1. Register direct Add R4, R3 R4 <- R4 + R3 2. Immediate Add R4, #3 R4 <- R4 + 3 3. Displacement Add R4, 100(R1) R4 <- R4 + M[100 + R1] 4. Register indirect Add R4, (R1) R4 <- R4 + M[R1] 5. Indexed Add R4, (R1 + R2) R4 <- R4 + M[R1 + R2] 6. Direct Add R4, (1000) R4 <- R4 + M[1000] 7. Memory Indirect Add R4, @(R3) R4 <- R4 + M[M[R3]] 8. Autoincrement Add R4, (R2)+ R4 <- R4 + M[R2] R2 <- R2 + d 9. Autodecrement Add R4, (R2)- R4 <- R4 + M[R2] R2 <- R2 - d 10. Scaled Add R4, 100(R2)[R3] R4 <- R4 + M[100 + R2 + R3*d] Studies by [Clark and Emer] indicate that modes 1-4 account for 93% of all operands on the VAX.

31. ISA CSCE430/830 Types of Operations Arithmetic and Logic: AND, ADD Data Transfer: MOVE, LOAD, STORE Control BRANCH, JUMP, CALL System OS CALL, VM Floating Point ADDF, MULF, DIVF Decimal ADDD, CONVERT String MOVE, COMPARE Graphics (DE)COMPRESS

32. ISA-2 CSCE430/830 MIPS Instructions All instructions exactly 32 bits wide Different formats for different purposes Similarities in formats ease implementation op rs rt offset 6 bits 5 bits 5 bits 16 bits op rs rt rd funct shamt 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits R-Format I-Format op address 6 bits 26 bits J-Format 31 0 31 0 31 0

33. ISA-2 CSCE430/830 MIPS Instruction Types Arithmetic & Logical - manipulate data in registers add $s1, $s2, $s3 $s1 = $s2 + $s3 or $s3, $s4, $s5 $s3 = $s4 OR $s5 Data Transfer - move register data to/from memory load & store lw $s1, 100($s2) $s1 = Memory[$s2 + 100] sw $s1, 100($s2) Memory[$s2 + 100] = $s1 Branch - alter program flow beq $s1, $s2, 25 if ($s1==$s1) PC = PC + 4 + 4*25 else PC = PC + 4

34. ISA-2 CSCE430/830 MIPS Arithmetic & Logical Instructions Instruction usage (assembly) add dest, src1, src2 dest=src1 + src2 sub dest, src1, src2 dest=src1 - src2 and dest, src1, src2 dest=src1 AND src2 Instruction characteristics Always 3 operands: destination + 2 sources Operand order is fixed Operands are always general purpose registers Design Principles: Design Principle 1: Simplicity favors regularity Design Principle 2: Smaller is faster

35. ISA-2 CSCE430/830 Arithmetic & Logical Instructions - Binary Representation Used for arithmetic, logical, shift instructions op : Basic operation of the instruction ( opcode ) rs : first register source operand rt : second register source operand rd : register destination operand shamt : shift amount (more about this later) funct : function - specific type of operation Also called R-Format or R-Type Instructions op rs rt rd funct shamt 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits 0 31

36. ISA-2 CSCE430/830 op rs rt rd funct shamt 6 bits 5 bits 5 bits 5 bits 5 bits 6 bits Decimal Binary Arithmetic & Logical Instructions - Binary Representation Example Machine language for add $8, $17, $18 See reference card for op , funct values 000000 0 10001 17 10010 18 01000 8 00000 0 100000 32 0 31

37. ISA-2 CSCE430/830 MIPS Data Transfer Instructions Transfer data between registers and memory Instruction format (assembly) lw $dest, offset($addr) load word sw $src, offset($addr) store word Uses: Accessing a variable in main memory Accessing an array element