Best Practices to Improve Query Performance in a Data Warehouse - 1 .


89 views
Uploaded on:
Description
Best Practices to Enhance Question Execution in an Information Stockroom - 1. Calisto Zuzarte, STSM, IBM, calisto@ca.ibm.com. Information Distribution center Life Cycle. Database plan/Application outline
Transcripts
Slide 1

Best Practices to Improve Query Performance in a Data Warehouse - 1 Calisto Zuzarte, STSM, IBM, calisto@ca.ibm.com

Slide 2

Data Warehouse Life Cycle Database plan/Application outline The Warehouse Application modelers and Database Administrators cooperate to plan the questions and pattern before they put the application underway Database execution layer usage so as to meet SLAs, DBAs regular experience a few emphasess increasing the database with execution layer protests and set up the underlying design to get great execution Database tuning operations During generation, with changing prerequisites and change in information, there is on-going tuning required to keep operations smooth.

Slide 3

Motivation Data stockroom conditions attributes: Large volumes of information Millions/Billions of columns required in a few tables Large Joins Large Sorts, Large Aggregations Many tables included Large measure of information came in and took off Complex inquiries Report Queries Ad Hoc Queries It is vital to focus on inquiry execution

Slide 4

Objective Provide proposals from a DB2 streamlining agent point of view to enhance question execution through the Data Warehouse life cycle

Slide 5

Agenda SESSION 1 Best Practices – Database Design Best Practices – Application Design Best Practices – Configuration and Operations SESSION 2 Best Practices – Performance Layer

Slide 6

Best Practices – Database Design Parallelism Inter-parcel Shared nothing parallelism (DPF) Intra-Query Parallelism (SMP) Partitioning Database Partitioning Table Partitioning Table (Range) Partitioning UNION ALL Views Multi-Dimension Clustering Schema

Slide 7

Best Practices - Parallelism DPF or SMP or both ? Database parcel highlight (DPF) is by and large prescribed to accomplish parallelism in an information stockroom Achieves versatility and inquiry execution SMP (Intra-Query Parallelism) is NOT suggested in simultaneous multi-client conditions with overwhelming CPU use SMP is just prescribed When CPUs are profoundly under used and when DPF is impossible

Slide 8

Partitioning (Complimentary Strategies in DB2) "Database Partitioning" "Circulation Key" Database Partitioning (DPF) CREATE TABLE … DISTRIBUTE BY HASH Key Benefit : Better adaptability and execution through parallelism Table Partitioning Table (Range) Partitioning CREATE TABLE … PARTITION BY RANGE Key Benefit : Better information administration (come in and take off of information) UNION ALL Views CREATE VIEW V AS (SELECT … FROM F1 UNION ALL … ) Key Benefit : Independent branch improvement Multidimensional Clustering (MDC) CREATE TABLE … ORGANIZE BY DIMENSION Key Benefit : Better question execution through information grouping "Table Partitioning" "Table Partitioning Key" "UNION ALL branch Partitioning" "Cells", "Pieces", "Measurements"

Slide 9

Jan Feb Jan Feb Jan Feb North South North South North South North South North South North South East West East West East West East West East West East West Distribute By … Partition By … Organize By .. Make TABLE … DISTRIBUTE BY HASH PARTITION BY RANGE ORGANIZE BY DIMENSION Database Partition 1 Database Partition 3 Database Partition 2 TS1 TS2 TS1 TS2

Slide 10

Best Practices – DPF Partitioning Collocate the reality and biggest much of the time joined measurement Choose to maintain a strategic distance from huge skew on a few allotments Avoid DATE measurement where dynamic exchanges for current date all fall on one database parcel (TIMESTAMP is great) Possibilities for workload disengagement for information bazaars Different segment bunches yet basic measurement tables Recommend that measurement tables be imitated (examined later)

Slide 11

Best Practices – Table Partitioning Recommend dividing the reality tables Recommend utilizing the DATE measurement Works better with application key predicates connected straightforwardly Table (Range) Partitioning Consider apportioned lists with V9.7 Choose apportioning in view of come in/take off granularity UNION ALL Views Define see predicates or CHECK Constraints to get branch end with question predicates (with constants just) Use UNION ALL perspectives just with very much planned applications Dangers of emergence with impromptu inquiries Large number of branches needs time and memory to improve

Slide 12

Best Practices – Multidimensional Clustering (MDC) Recommend characterizing MDC on the reality table Guaranteed grouping (Avoids the need to REORG for grouping) I/O advancement Compact records (minimized, exists together with consistent files) Choose measurements in light of inquiry predicates Recommend the utilization of 1 to 4 measurements Need to guarantee measurements are picked to such an extent that they don\'t squander capacity Could pick a better granularity of Table dividing range For instance: Table segment go by month, MDC by date

Slide 13

Star Schema STORE PRODUCT Store_id Region_id … Product_id Class_id Group_id Family_id Line_id Division_id … SALES Product_id Store_id Channel_id Date_id Amount Quantity … TIME CHANNEL Date_id Month_id Quarter_id Year_id Channel_id …

Slide 14

Dimension Hierarchy Product Dimension Division Level 5 Line Level 4 Time Dimension Year Family Level 3 Quarter Group Level 2 Store Dimension Month Retailer Class Level 1 Channel Dimension Date Product Channel Store Level 0 Sales Fact

Slide 15

Best Practices - Schema Surrogate Keys As far as conceivable utilize application keys themselves permits predicates to be connected/exchanged specifically on the reality table DATE is a decent competitor (simpler to come in/take off and for MDC ) Star Schema/Snowflakes Separate tables for each measurement pecking order (snowflake) may bring about countless Flattened measurements may contain a ton of repetition (space) Define Columns NOT NULL when proper Many enhancements that are done in view of NOT NULL Define Uniqueness when suitable Primary Keys/Unique Constraints/Unique Indexes

Slide 16

Agenda SESSION 1 Best Practices – Database Design Best Practices – Application Design Best Practices – Configuration and Operations SESSION 2 Best Practices – Performance Layer

Slide 17

Application Considerations - Expressions Use constants rather than expressions in the question Example SELECT … WHERE DateCol <= CURRENT DATE – 5 Use VALUES(CURRENT DATE – 5) to get the steady first and utilize it in the inquiry Avoid expressions on filed sections Example SELECT … WHERE DATECOL – 2 DAYS > \'2009-10-22\' SELECT … WHERE DATECOL > \'2009-10-22\' + 2 DAYS Similar proposal with cast capacities Example SELECT … WHERE INT(CHARCOL) = 2009 SELECT … WHERE CHARCOL = "2009" Note you may lose Errors/Warnings

Slide 18

Application Considerations – Table Partitioning/MDC As far as conceivable put neighborhood predicates straightforwardly on Table Partition or MDC measurement segments of the reality table SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and T.Date = \'2009-01-15\' and T.KEYCOL= F.TIMEKEYCOL Simplify if the TIMEKEYCOL is corresponded to the TIME values (For instance TIMEKEYCOL= 20090115 for the date \'2009-01-15\') SELECT ... FROM CUSTDIM C, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and F.TIMEKEYCOL = 20090115

Slide 19

Application Considerations – Table Partitioning/MDC Another case … consider SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL First get the qualities for MINKEY and MAXKEY SELECT MIN(KEYCOL) FROM TIMEDIM WHERE YEAR=2009 SELECT MAX(KEYCOL) FROM TIMEDIM WHERE YEAR=2009 Then compose the SQL as takes after SELECT ... FROM CUSTDIM C, TIMEDIM T, FACT F WHERE C.country=USA and C.KEYCOL=F.CUSTKEYCOL and T.YEAR = 2009 and T.KEYCOL= F.TIMEKEYCOL AND F.TIMEKEYCOL >= MINKEY AND F.TIMEKEYCOL <= MAXKEY

Slide 20

Application Considerations – General Recommendations Avoid redundancies of complex expressions Use Global Temporary Tables to part a question on the off chance that it contains more than around 15 tables and arrange time is an issue

Slide 21

Agenda SESSION 1 Best Practices – Database Design Best Practices – Application Design Best Practices – Configuration and Operations SESSION 2 Best Practices – Performance Layer

Slide 22

Best Practices – Configuration and Operations Configuration Database Configuration DBMS Configuration Registry Settings Operations Collecting Statistics

Slide 23

Configuration Optimization Level 5 Avoid numerous bufferpools of a similar page measure Configuration thumb rules BUFFPOOL ~= SHEAPTHRES SORTHEAP ~= SHEAPTHRES/(# of simultaneous SORT, HSJN)

Slide 24

Registry Variables DB2_ANTIJOIN=EXTEND If moderate inquiries have NOT EXISTS, NOT IN predicates

Slide 25

Registry Variables DB2_REDUCED_OPTIMIZATION=YES Set if accumulate time is an issue IBM Service may prescribe a more unpredictable setting for instance: DB2_REDUCED_OPTIMIZATION= 10 ,15,20 , 00011000… . Initial segment : DB2_REDUCED_OPTIMIZATION= A,B,C IF more than C joins, then "quick greedy" ELSE IF more than B joins, then utilize "insatiable" ELSE IF more than A joins, utilize lessened "element" methodology. Second Part not archived (Mainly expected for setting by administration)

Slide 26

Best Practices Optimization Level 5 BUFFERPOOL~=SHEAPTHRES DB2_ANTIJOIN=EXTEND DB2_REDUCED_OPTIMIZATION=YES

Slide 27

Collecting Statistics The DB2 Query Optimizer depends on sensibly precise measurements to get a decent question arranges User runs RUNSTATS when information changes (some portion of ETL) Statistics Fabrication (problematic) DB2 keeps UPDATE/DELETE/INSERT counters Fabrication constrained to a couple of insights – insufficient Automatic Statistics Automatically gathers insights on tables in need Runs out of sight as a low pri

Recommended
View more...