Code Improvement I: Machine Autonomous Advancements.

Uploaded on:
Code Advancement I: Machine Autonomous Improvements. Subjects Machine-Autonomous Improvements Code movement Diminishment in quality Basic subexpression sharing Tuning Recognizing execution bottlenecks. class26.ppt. Extraordinary Reality #4.
Slide 1

Code Optimization I: Machine Independent Optimizations Topics Machine-Independent Optimizations Code movement Reduction in quality Common subexpression sharing Tuning Identifying execution bottlenecks class26.ppt

Slide 2

Great Reality #4 There’s more to execution than asymptotic intricacy Constant components matter as well! Effortlessly see 10:1 execution reach contingent upon how code is composed Must enhance at various levels: calculation, information representations, strategies, and circles Must comprehend framework to advance execution How projects are gathered and executed How to quantify program execution and distinguish bottlenecks How to enhance execution without wrecking code measured quality and sweeping statement

Slide 3

Optimizing Compilers Provide productive mapping of system to machine register portion code determination and requesting taking out minor inefficiencies Don’t (typically) enhance asymptotic effectiveness up to developer to choose best general calculation huge O reserve funds are (frequently) more essential than steady components however consistent variables additionally matter Have trouble overcoming “optimization blockers” potential memory associating potential technique symptoms.:

Slide 4

Limitations of Optimizing Compilers Operate Under Fundamental Constraint Must not bring on any adjustment in project conduct under any conceivable condition Often keeps it from making improvements when might just influence conduct under neurotic conditions. Conduct that may be clear to the software engineer can be jumbled by dialects and coding styles e.g., information extents may be more restricted than variable sorts recommend Most examination is performed just inside of strategies entire system investigation is excessively extravagant as a rule Most examination is construct just in light of static data compiler experiences issues envisioning run-time inputs When in uncertainty, the compiler must be preservationist

Slide 5

Machine-Independent Optimizations you ought to do paying little respect to processor/compiler Code Motion Reduce recurrence with which calculation performed If it will dependably deliver same result Especially moving code out of circle for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Slide 6

Compiler-Generated Code Motion Most compilers benefit a vocation with cluster code + basic circle structures Code Generated by GCC for (i = 0; i < n; i++) { int ni = n*i; int *p = a+ni; for (j = 0; j < n; j++) *p++ = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j]; imull %ebx,%eax # i*n movl 8(%ebp),%edi # a leal (%edi,%eax,4),%edx # p = a+i*n (scaled by 4) # Inner Loop .L40: movl 12(%ebp),%edi # b movl (%edi,%ecx,4),%eax # b+j (scaled by 4) movl %eax,(%edx) # *p = b[j] addl $4,%edx # p++ (scaled by 4) incl %ecx # j++ jl .L40 # circle if j<n

Slide 7

Reduction in Strength Replace unreasonable operation with more straightforward one Shift, include rather than duplicate or gap 16*x - - > x << 4 Utility machine ward Depends on expense of increase or separation guideline On Pentium II or III, number reproduce just obliges 4 CPU cycles Recognize succession of items int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Slide 8

Make Use of Registers Reading and composing registers much speedier than perusing/written work memory Limitation Compiler not generally ready to figure out if variable can be held in register Possibility of Aliasing See case later

Slide 9

Machine-Independent Opts. (Cont.) Share Common Subexpressions Reuse segments of expressions Compilers frequently not exceptionally advanced in abusing math properties/* Sum neighbors of i,j */up = val[(i-1)*n + j]; down = val[(i+1)*n + j]; left = val[i*n + j-1]; right = val[i*n + j+1]; aggregate = up + down + left + right; int inj = i*n + j; up = val[inj - n]; down = val[inj + n]; left = val[inj - 1]; right = val[inj + 1]; whole = up + down + left + right; 3 augmentations: i*n, (i–1)*n, (i+1)*n 1 duplication: i*n leal - 1(%edx),%ecx # i-1 imull %ebx,%ecx # (i-1)*n leal 1(%edx),%eax # i+1 imull %ebx,%eax # (i+1)*n imull %ebx,%edx # i*n

Slide 10

length 0 1 2 length–1 information  Vector ADT Procedures vec_ptr new_vec(int len) Create vector of determined length int get_vec_element(vec_ptr v, int list, int *dest) Retrieve vector component, store at *dest Return 0 if beyond the field of play, 1 if fruitful int *get_vec_start(vec_ptr v) Return pointer to begin of vector information Similar to exhibit usage in Pascal, ML, Java E.g., dependably do limits checking

Slide 11

Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } Procedure Compute entirety of all components of vector Store result at destination area

Slide 12

Time Scales Absolute Time Typically utilize nanoseconds 10 –9 seconds Time size of PC guidelines Clock Cycles Most PCs controlled by high recurrence clock signal Typical Range 100 MHz 10 8 cycles for each second Clock period = 10ns 2 GHz 2 X 10 9 cycles for each second Clock period = 0.5ns Fish machines: 550 MHz (1.8 ns clock period)

Slide 13

Cycles Per Element Convenient approach to express execution of project that administrators on vectors or records Length = n T = CPE*n + Overhead vsum1 Slope = 4.0 vsum2 Slope = 3.5

Slide 14

Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } Procedure Compute total of all components of number vector Store result at destination area Vector information structure and operations characterized by means of dynamic information sort Pentium II/III Performance: Clock Cycles/Element 42.06 (Compiled - g) 31.25 (Compiled - O2)

Slide 15

Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; in the event that (i >= vec_length(v)) goto done; circle: get_vec_element(v, i, &val); *dest += val; i++; on the off chance that (i < vec_length (v)) goto circle done: } Inefficiency Procedure vec_length called each emphasis Even however come about dependably the same 1 emphasis

Slide 16

Move vec_length Call Out of Loop void combine2(vec_ptr v, int *dest) { int i; int length = vec_length (v); *dest = 0; for (i = 0; i < length ; i++) { int val; get_vec_element(v, i, &val); *dest += val; } Optimization Move get to vec_length out of internal circle Value does not change from one emphasis to next Code movement CPE: 20.66 (Compiled - O2) vec_length requires just consistent time, yet critical overhead

Slide 17

Code Motion Example #2 Procedure to Convert String to Lower Case Extracted from 213 lab entries, Fall, 1998 void lower(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= "A" && s[i] <= \'Z\') s[i] - = (\'A\' - \'a\'); }

Slide 18

Lower Case Conversion Performance Time quadruples when twofold string length Quadratic execution

Slide 19

Convert Loop To Goto Form void lower(char *s) { int i = 0; on the off chance that (i >= strlen(s)) goto done; circle: if (s[i] >= "A" && s[i] <= \'Z\') s[i] - = (\'A\' - \'an\'); i++; on the off chance that (i < strlen (s)) goto circle; done: } strlen executed each emphasis strlen straight long of string Must sweep string until finds " \0 " Overall execution is quadratic

Slide 20

Improving Performance void lower(char *s) { int i; int len = strlen (s); for (i = 0; i < len ; i++) if (s[i] >= "A" && s[i] <= \'Z\') s[i] - = (\'A\' - \'a\'); } Move call to strlen outside of circle Since result does not change starting with one cycle then onto the next Form of code movement

Slide 21

Lower Case Conversion Performance Time copies when twofold string length Linear execution

Slide 22

Optimization Blocker: Procedure Calls Why couldn’t the compiler move vec_len or strlen out of the inward circle? Methodology may have reactions Alters worldwide express every time called Function may not return same worth for given contentions Depends on different parts of worldwide state Procedure lower could interface with strlen Why doesn’t compiler take a gander at code for vec_len or strlen ? Linker may over-burden with distinctive form Unless proclaimed static Interprocedural advancement is not utilized broadly because of expense Warning: Compiler regards methodology call as a discovery Weak improvements in and around them

Slide 23

Reduction in Strength void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } Optimization Avoid technique call to recover every vector component Get pointer to begin of exhibit before circle Within circle simply do pointer reference Not as spotless as far as information deliberation CPE: 6.00 (Compiled - O2) Procedure calls are costly! Limits checking is extravagant

Slide 24

Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int entirety = 0; for (i = 0; i < length; i++) aggregate += data[i]; *dest = total; } Optimization Don’t need to store in destination until end Local variable total held in register Avoids 1 memory read, 1 memory compose per cycle CPE: 2.00 (Compiled - O2) Memory references are costly!

Slide 25

Detecting Unneeded Memory Refs. Combine3 Combine4 Performance Combine3 5 guidelines in 6 clock cycles addl must read and co

View more...