Effect of Compiler-construct Data-Prefetching Techniques with respect to SPEC OMP Application Performance .

Uploaded on:
Category: Medical / Health
Impact of Compiler-based Data-Prefetching Techniques on SPEC OMP Application Performance. 2005-23523 이영준 (Lee Young Joon ) MSL, EECS, SNU 2006.06.07. Contents. Introduction Intel Compiler Overview Prefetching for Itanium 2 Processor Experimental Evaluations Concluding Remarks Reference.
Slide 1

Effect of Compiler-construct Data-Prefetching Techniques with respect to SPEC OMP Application Performance 2005-23523 이영준 (Lee Young Joon ) MSL, EECS, SNU 2006.06.07.

Slide 2

Contents Introduction Intel Compiler Overview Prefetching for Itanium 2 Processor Experimental Evaluations Concluding Remarks Reference MSL, EECS, SNU

Slide 3

1. Presentation The memory divider challenge the processor-memory speed hole Remedy Latency resistance Software information prefetching Latency disposal Long inactivity end procedures (territory improvements) MSL, EECS, SNU

Slide 4

In this paper, Examine the effect of programming information prefetching on SPEC OMP application OpenMP application execution on shared memory framework OpenMP C++/C and Fortran 2.0 guidelines utilizing Intel C++ and Fortran compilers on a SGI Altix 32-way SMP machine worked with Itanium 2 processors Most of compiler investigations and enhancement are done before information prefetching stage using the administrations of a propelled memory disambiguation module pointer examination, address-taken examination, cluster reliance investigation, dialect semantics, and different sources MSL, EECS, SNU

Slide 5

2. Intel Compiler Overview Intel Itanium 2 processor has new building and smaller scale structural elements Intel Itanium compiler exploits it EPIC (Explicitly Parallel Instruction Computing) for a lot of ILP Control and information hypothesis permitting burdens to be booked crosswise over branches or other memory operations Predication MSL, EECS, SNU

Slide 6

Intel Compiler Features Supports both programmed enhancement and developer controlled techniques Advanced compiler innovations profile-guided multi-record between procedural investigation and enhancements memory disambiguation/advancements parallelization information and circle changes worldwide code planning predication, theory User can use multiprocessor by rolling out little improvements to the source code ex) OpenMP orders MSL, EECS, SNU

Slide 7

Compiler Optimizations Compiler enhancements in Intel compiler Multi-Level Parallelism (MLP) Instruction Level Parallelism + Thread Level Parallelism Inter-Procedural Optimization (IPO) focuses to examination (helps memory disambiguation), mod/ref examination High-Level Optimization (HLO) circle change (circle combination, circle tiling, circle unroll-and stick, circle dissemination), programming information prefetching, scalar substitution, information changes enhance information territory and lessen memory get to idleness Scalar Optimizations branch-consolidating, quality decrease, consistent spread, dead code disposal, duplicate proliferation, incomplete dead store end, and halfway repetition end (PRE) Task Queuing Model to adventure sporadic parallelism viably broadens scope past standard OpenMP programming model MSL, EECS, SNU

Slide 8

3. Prefetching for Itanium 2 Processor Software information prefetching conceal memory get to inertness by drawing referenced information nearer to the CPU don\'t square guideline stream don\'t raise an exeption Software information prefetching in Intel compiler exploits Itanium 2 design highlights expectation turning registers information theory MSL, EECS, SNU

Slide 9

Rotating Registers r127 empowers compact usage of programming pipelining with predication pivoted by one enroll position every time one of the exceptional circle branches is executed after one revolution, the substance of enlist X will be found in enlist X+1 r32-r127, f32-f127, p16-p63 (predicate regs) others don\'t pivot (static registers) pivoting (selectable) rX+1 rX r32 not pivoting (static) r31 r0 r31 r0 MSL, EECS, SNU

Slide 10

Prefetch Principles Avoid effectively stacked information as of now in store Issue at the opportune time early so it is accessible recently so it is not expelled Prefetch remove evaluated in view of memory dormancy, asset necessities, information reliance data. prefetch ask for store ousting MSL, EECS, SNU

Slide 11

Data-territory examination Three sorts of information region that are recognized in Intel compiler Spatial region if information references inside a circle get to various memory areas that fall inside similar reserve line Temporal region if an information reference gets to similar memory area numerous times Group region if distinctive information references get to similar store line MSL, EECS, SNU

Slide 12

An Example of Data-Prefetching spatial region: x(0), ..., x(99), and y(- 1), ..., y(100) aggregate region: y(k-1), y(k+1) - w.r.t. k circle cycles if() proclamation can be supplanted by predication control reliance - > information reliance diminishes branch misprediction punishment If reserve line size=128B, exhibit component size=8B prefetch separate: D = 16 emphasess computed by compiler Assume k=0, D=8 If the cluster components x(k+D) and y(k-1+D) are prefetched, cluster gets to x(9:15) and y(8:14) will hit the store MSL, EECS, SNU

Slide 13

Other... Vast number of registers store memory locations of prefetching in registers no requirement for enroll spill and fill inside circles Itanium 2 engineering underpins memory get to insights ex) if an information reference won\'t be reused, keep away from reserve contamination - lfetch "nta" imply These components bolster compiler to improve information reuse examination on information development crosswise over circle bodies can evade pointless prefetches MSL, EECS, SNU

Slide 14

4. Test Evaluation 4.1. Philosophy the SPEC OMPM2001 benchmark suite comprises of an arrangement of OpenMP based application programs input reference information sets are gotten from logical calculations on SMP frameworks 11 huge application programs 8 in Fortran, 3 in C requires a virtual address space of 2GB bigger than SPEC CPU2000 can keep running in a 32-bit address space MSL, EECS, SNU

Slide 15

Experimental System SGI Altix3000 framework a circulated shared memory (DSM) design NUMAflex (NUMA 3) a worldwide address-space, reserve intelligent multiprocessor ccNUMA 32 Intel Itanium 2 1.5GHz processors every CPU has 16KBI+16KBD L1 store 256KB on chip L2 reserve 6MB on chip L3 reserve 256GB memory for each 4-CPU module OS: SGI ProPack v3 The compiler: Intel C++ and Fortran95 compiler adaptation 8.1 beta All tests utilize 32 strings mapped onto 32 processors (one CPU for every string) MSL, EECS, SNU

Slide 16

R-block crossbar switch C-block a.k.a. register hub Super-Bedrock ASIC Direct association with I/O 6.4GB/s every Router ASIC SGI Altix3000 Block Diagram MSL, EECS, SNU

Slide 17

4.2. Effect of Software Data-Prefetching By empowering information prefetching in the Intel compiler with streamlining, for example, parallelization, privatization circle changes IPO scalar substitution prefetching programming pipelining The information prefetching stage is after all these advancement can profit by the past ones, more successful prefetching The collaboration between these improvement is extremely perplexing MSL, EECS, SNU

Slide 18

Performance pick up with programming information prefetching 314.mgrid_m: right around 100% pick up 6 others: more prominent than 10% 332.ammp_m: under 1% pick up We talk about in detail in the accompanying MSL, EECS, SNU

Slide 19

4.3. Effect of Prefetching for Loads Only For applications that are memory limited, stack/store rules execution including additional prefetch will expand the weight on memory station On SMP framework, maintaining a strategic distance from asset dispute on the memory framework is vital Experiment: issuing prefetches just for memory references that are loads and contrasted the outcome and the full prefetching for burdens and stores form MSL, EECS, SNU

Slide 20

Gain/misfortune with prefetching loads just Reduced memory transfer speed weight → execution increase 312.swim_m and 314.mgrid_m memory transmission capacity bound applications with a considerable measure of spilling information gets to For not memory limited projects, execution misfortune because of the memory idleness of stores Not a general material plan for most application The geometric mean: 0.06% MSL, EECS, SNU

Slide 21

4.4. Prefetching for Spatial Locality When a store line is filled it contains various components of a cluster For an information reference with spatial region require one and only prefetch guideline in a few cycles no reserve misses for this memory reference MSL, EECS, SNU

Slide 22

Gain/misfortune with prefetching for references showing spatial region Geometric mean: 21.89% execution increase 332.ammp_m: a lull because of a clamor of execution estimation (OS string booking) contrasted with area 4.2, this contributes 73.09% of the aggregate increase common circles display spatial region compiler ought to exploit it MSL, EECS, SNU

Slide 23

4.5. Prefetching utilizing Rotating Registers Itanium 2 handle has turning registers Register revolution gives an equipment renaming component helps the compiler to control prefetching with least overhead Clever plan of enhancing programming information prefetching diminishes the quantity of issue spaces for prefetch guidelines evades branch mispredict punishments conditionals or predicats calculation stays away from the requirement for circle unrolling Note: a portion of the prefetches will be excess (same store line) MSL, EECS, SNU

Slide 24

Gain/Loss of Prefetching utilizing Rotating Registers benchmark execution without utilizing the pivoting register conspire utilizing a contingent explanation inside the circle may get predicated by the compiler Geometric mean: 2.71% pick up prefetching utilizing the pivoting registers plot brings a positive effect MSL, EECS, SNU

Slide 25

4.6. Prefetching for Spatial References with No Predication Almost all Itanium 2 prepare directions have a qualifying predicate 64 predicate registers: p0-p63 Rotating predicate registers to abstain from overwriting a predicate esteem that is still alive to control the filling and depleting of a product pipelining To prefetch spatially-nearby references, compiler minimizes repetitive prefetches and keep away from branch mispredict punishments Note: the turning register strategy works just for programming pipelined circles MSL, EECS, SNU

Slide 26

Gain/Loss of prefetching with no predication 7 out of 11 applications accomplished execution picks up Itanium 2 proce

View more...