Assessment of Multi-center Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009Slide 2
Overview Motivation Contribution & scope Background Platforms Algorithms Experimental Results ConclusionSlide 3
Motivation Fast handling reaction a noteworthy prerequisite in many picture preparing applications. Picture preparing calculations can be computationally costly Data should be handled in parallel, and improved for continuous execution Recent presentation of hugely parallel PC designs promising critical speeding up. A few structures haven\'t been effectively investigated yet.Slide 4
Overview Motivation Contribution & scope Background Platforms Algorithms Experimental Results ConclusionSlide 5
Contribution & extent of the proposal This proposition adjusts and streamlines three picture handling and PC vision calculations for four multi-center models. The timings are discovered Obtained timings are analyzed against accessible comparing past work (intra-class) and engineering sort (between class). Suitable reasonings are made in light of results.Slide 6
Overview Motivation Contribution & scope Background Platforms Algorithms Implementation ConclusionSlide 7
Background Need for Parallelization SIMD Optimization The requirement for speedier execution time Related work Canny edge discovery on CellBE [Gupta et al.] and on GPU [ Luo et al.] KLT following usage on GPU [ Sinha et al., Zach et al.]Slide 8
Overview Motivation Contribution & scope Background Platforms Algorithms Implementation Experimental Results ConclusionSlide 9
Hardware & Software PlatformsSlide 10
Intel NetBurst & Core Microarchitectures Improved execution/watt calculate. SSSE3 bolster for powerful XMM registers\' use. Bolsters SSE4 Scales upto Quad-center Can execute legacy IA-32 and SIMD applications at higher clock rate. HT permits synchronous multithreading. Has two coherent processors on each physical processor Support for upto SSE3Slide 11
Cell Broadband Engine (CBE) PPE PPU L1 Instruction Cache L1 Data Cache Structural outline of the Cell Broadband Engine L2 Cache SPE Graphics Device PPE Graphics Device Graphics Device PPE EIB I/O Devices SPE Main Memory I/O Devices I/O Devices SPU Main Memory Main Memory Local Store (LS) SPE Memory Flow Controller (MFC)Slide 12
Cell processor review One Power-based PPE, with VMX 32/32kB I/D L1, and 512kB L2 double issue, all together PPU, 2 HW strings Eight SPEs, with up to 16x SIMD double issue, all together SPU 128 registers (128b wide) 256 kB nearby store (LS) 2x 16B/cycle DMA, 16 remarkable req. Component Interconnect Bus (EIB) 4 rings, 16B wide (at 1:2 clock) 96B/cycle top, 16B/cycle to memory 2x 16B/cycle BIF and I/O External correspondence Dual XDR memory controller (MIC) Two configurable transport interfaces (BIC) Classical I/O interface SMP intelligent interfaceSlide 13
Graphics Processing Unit (GPU) Data stream in GPU F R A M E B U F E R Vertex Processor Assemble & Rasterize Fragment Processor Frame support Operations Application TexturesSlide 14
Nvidia GeForce 8 Series GPU Graphics pipeline in NVIDIA GeForce 8 Series GPUSlide 15
Compute Unified Device Interface (CUDA) Computing motor in Nvidia GPUs Makes GPU a register gadget into a very multithreaded coprocessor. Gives both low level and a more elevated amount APIs Has a few favorable circumstances over GPUs utilizing illustrations APIs (e.g.: OpenGL)Slide 16
Overview Motivation Contribution & scope Background Platforms Algorithms Experimental Results ConclusionSlide 17
Algorithm 1: Gaussian Smoothing Gaussian smoothing is a separating part Removes little scale surface and commotion for given spatial degree 1-D Gaussian piece composed as: 2-D Gaussian bit: SeparableSlide 18
Gaussian Smoothing (case)Slide 19
Algorithm 2: Canny Edge Detection Edge identification a regularly operation in picture preparing Edges are discontinuities in picture dim levels, have solid force differentiate. Shrewd Edge Detection is an ideal edge-locator calculation. Delineated ahead with an illustration.Slide 20
Canny Edge Detection (illustration)Slide 21
Algorithm 3: KLT Tracking First proposed by Lucas and Kanade. Reached out by Tomasi and Kanade and Shi and Tomasi . Firstly, figure out what feature(s) to track through element determination Secondly, track the chose feature(s) crosswise over picture succession. Lays on three suppositions: transient tirelessness, spatial intelligibility and splendor consistencySlide 22
Algorithm 3: KLT TrackingSlide 23
Overview Motivation Contribution & scope Background Platforms Algorithms Results ConclusionSlide 24
Gaussian Smoothing: Results Lenna MandrillSlide 25
Results: Gaussian SmoothingSlide 26
Canny edge discovery: Results Lenna MandrillSlide 27
Results: Canny edge recognitionSlide 28
Results: Canny Edge Detection Comparison with different executions on Cell Comparison with different usage on GPUSlide 29
Results: KLT TrackingSlide 30
Results: KLT Tracking Comparison with different usage on GPU Comparison with different usage on GPU No known usage yet.Slide 31
Overview Motivation Contribution & scope Background Platforms Algorithms Results Conclusion & ExtensionSlide 32
Conclusion & Future work GPU still in front of different models, most suited for picture preparing applications. Enhancing PS3 could enhance timings to limit the hole amongst its and GPU timings. We could give: Support for quicker shading Canny. Bolster for part width bigger than 5 Better administration of string arrangement in GPU if not a different of 16 Include Intel Xeon & Larrabee as potential models.Slide 33
Additional SlidesSlide 35
CBE Architecture Contains conventional chip, PowerPC Processor Element (PPE) – Controls undertakings 64-bit PPC: 32 KB L1 direction reserve, 32 KB L1 information store, and 512 KB L2 reserve. PPE controls 8 synergistic processor components (SPEs) working as SIMD units Each SPE has a SPU and a memory stream controller (MFC) - information serious assignments SPU (RISC) with 128-piece SIMD registers 256KB neighborhood store (LS). PPE, SPE, MIC, BIC associated by Element Interconnect Bus (EIB) – for information development - ring transport comprising of four 16 byte channels giving managed b/w of 204.8 GB/s. MFC association with Rambus XDR memory and BIC interface to I/O gadgets associated by means of RapidIO give 25.6 GB/s of information b/w.Slide 36
CBE: What makes it quick? Enormous between SPE transfer speed 205 GB/s maintained yield Fast primary memory 256.5 GB/s transmission capacity for Rambus XDR memory Predictable DMA dormancy and throughput DMA activity has insignificant effect on SPE nearby store transfer speed Easy to cover information development with calculation High execution, low-control SPE centersSlide 37
Nvidia GeForce (Continued) GPU has K multiprocessors (MP) Each MP has L scalar processors (SP) Each MP performs piece handling in clusters A square is prepared by just a single MP Each piece is part into SIMD gatherings of strings (twists) A twist is executed physically in parallel A scheduler switches between twists A twist contains strings of expanding, back to back string IDs Currently a twist size is 32 stringsSlide 38
CUDA: Programming model Grid of string squares Block (0,0) Block (1,0) Block (2,0) Block (3,0) Grid comprise of string obstructs Each string executes the portion Grid and square measurements indicated by application. Max. by GPU memory 1/2/3-D framework format Thread and Block-IDs are special Block (0,1) Block (1,1) Block (2,1) Block (3,1) Block (2,1) Thread (0,0) Thread (1,0) Thread (3,0) Thread (4,0) Thread (5,0) Thread (0,1) Thread (1,1) Thread (3,1) Thread (4,1) Thread (5,1) Thread (0,7) Thread (1,7) Thread (3,7) Thread (4,7) Thread (5,7) Warp 1 Warp 2Slide 39
CUDA: Memory demonstrate Shared memory(R/W) - For sharing information inside square Texture memory – spatially stored Constant memory – About 20K, reserved Global Memory – Not reserved, blend Explicit GPU memory alloc/de-allotment Slow duplicating amongst CPU and GPU memory
Book Reading Interface: Image Processing Issues. J.Chetan, V.Sreekanth, ... Genuine experience o ...
Chameleon Features. Equipment and Software breakpoints. Fly-by presentation ... Existing ELF sof ...
NUMA frameworks that keep up reserve rationality are alluded to as ccNUMA ... Vector Array Share ...
Parts of Oversight Agencies. ACGME Residents in preparing RRC accreditation processABMS Initial ...
Why Panoramas ?. Cartography: sewing airborne pictures to make maps. Manhattan, 1949. Why Panora ...
Objective. To utilize a uniform gauge against which an offeror\'s answer is contrasted with deci ...
Evaluation of Vaginitis. Baylor College of Medicine Anoop Agrawal, M.D. Background. Vagini ...
Gathering Panelists. Bill Schroth - ModeratorAnna Colello - Evaluation Co-ChairEllen Flink - Eva ...
Spasmodic Dysphonia Evaluation and Management. UTMB Department of Otolaryngology Olvia Reve ...
The Sphere Project: Evaluation Report. Presented by: Ronald Waldman, M.D. Mailman School of ...
Sisteme avansate de analiza si prelucrare a imaginilor. Notare laborator : 25 puncte tema ...
2. Arrangement. Presentation AU BPM L\'IMPORTANCE DE LA STANDARDISATIONPROCESSUS M
Critical thinking. Your flat mate, who is taking CS139, is in a frenzy. He is concerned that he ...
2. Nuts and bolts. Before we endeavor to investigate a calculation, we have to characterize two ...