Assessment of Multi-center Architectures for Image Processing Algorithms .


72 views
Uploaded on:
Category: Medical / Health
Description
Assessment of Multi-center Structures for Picture Preparing Calculations. Experts Theory Presentation by Trupti Patil July 22, 2009. Review. Inspiration Commitment and extension Foundation Stages Calculations Test Results Conclusion. Inspiration.
Transcripts
Slide 1

Assessment of Multi-center Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009

Slide 2

Overview Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion

Slide 3

Motivation Fast handling reaction a noteworthy prerequisite in many picture preparing applications. Picture preparing calculations can be computationally costly Data should be handled in parallel, and improved for continuous execution Recent presentation of hugely parallel PC designs promising critical speeding up. A few structures haven\'t been effectively investigated yet.

Slide 4

Overview Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion

Slide 5

Contribution & extent of the proposal This proposition adjusts and streamlines three picture handling and PC vision calculations for four multi-center models. The timings are discovered Obtained timings are analyzed against accessible comparing past work (intra-class) and engineering sort (between class). Suitable reasonings are made in light of results.

Slide 6

Overview Motivation Contribution & scope Background Platforms Algorithms Implementation Conclusion

Slide 7

Background Need for Parallelization SIMD Optimization The requirement for speedier execution time Related work Canny edge discovery on CellBE [Gupta et al.] and on GPU [ Luo et al.] KLT following usage on GPU [ Sinha et al., Zach et al.]

Slide 8

Overview Motivation Contribution & scope Background Platforms Algorithms Implementation Experimental Results Conclusion

Slide 9

Hardware & Software Platforms

Slide 10

Intel NetBurst & Core Microarchitectures Improved execution/watt calculate. SSSE3 bolster for powerful XMM registers\' use. Bolsters SSE4 Scales upto Quad-center Can execute legacy IA-32 and SIMD applications at higher clock rate. HT permits synchronous multithreading. Has two coherent processors on each physical processor Support for upto SSE3

Slide 11

Cell Broadband Engine (CBE) PPE PPU L1 Instruction Cache L1 Data Cache Structural outline of the Cell Broadband Engine L2 Cache SPE Graphics Device PPE Graphics Device Graphics Device PPE EIB I/O Devices SPE Main Memory I/O Devices I/O Devices SPU Main Memory Main Memory Local Store (LS) SPE Memory Flow Controller (MFC)

Slide 12

Cell processor review One Power-based PPE, with VMX 32/32kB I/D L1, and 512kB L2 double issue, all together PPU, 2 HW strings Eight SPEs, with up to 16x SIMD double issue, all together SPU 128 registers (128b wide) 256 kB nearby store (LS) 2x 16B/cycle DMA, 16 remarkable req. Component Interconnect Bus (EIB) 4 rings, 16B wide (at 1:2 clock) 96B/cycle top, 16B/cycle to memory 2x 16B/cycle BIF and I/O External correspondence Dual XDR memory controller (MIC) Two configurable transport interfaces (BIC) Classical I/O interface SMP intelligent interface

Slide 13

Graphics Processing Unit (GPU) Data stream in GPU F R A M E B U F E R Vertex Processor Assemble & Rasterize Fragment Processor Frame support Operations Application Textures

Slide 14

Nvidia GeForce 8 Series GPU Graphics pipeline in NVIDIA GeForce 8 Series GPU

Slide 15

Compute Unified Device Interface (CUDA) Computing motor in Nvidia GPUs Makes GPU a register gadget into a very multithreaded coprocessor. Gives both low level and a more elevated amount APIs Has a few favorable circumstances over GPUs utilizing illustrations APIs (e.g.: OpenGL)

Slide 16

Overview Motivation Contribution & scope Background Platforms Algorithms Experimental Results Conclusion

Slide 17

Algorithm 1: Gaussian Smoothing Gaussian smoothing is a separating part Removes little scale surface and commotion for given spatial degree 1-D Gaussian piece composed as: 2-D Gaussian bit: Separable

Slide 18

Gaussian Smoothing (case)

Slide 19

Algorithm 2: Canny Edge Detection Edge identification a regularly operation in picture preparing Edges are discontinuities in picture dim levels, have solid force differentiate. Shrewd Edge Detection is an ideal edge-locator calculation. Delineated ahead with an illustration.

Slide 20

Canny Edge Detection (illustration)

Slide 21

Algorithm 3: KLT Tracking First proposed by Lucas and Kanade. Reached out by Tomasi and Kanade and Shi and Tomasi . Firstly, figure out what feature(s) to track through element determination Secondly, track the chose feature(s) crosswise over picture succession. Lays on three suppositions: transient tirelessness, spatial intelligibility and splendor consistency

Slide 22

Algorithm 3: KLT Tracking

Slide 23

Overview Motivation Contribution & scope Background Platforms Algorithms Results Conclusion

Slide 24

Gaussian Smoothing: Results Lenna Mandrill

Slide 25

Results: Gaussian Smoothing

Slide 26

Canny edge discovery: Results Lenna Mandrill

Slide 27

Results: Canny edge recognition

Slide 28

Results: Canny Edge Detection Comparison with different executions on Cell Comparison with different usage on GPU

Slide 29

Results: KLT Tracking

Slide 30

Results: KLT Tracking Comparison with different usage on GPU Comparison with different usage on GPU No known usage yet.

Slide 31

Overview Motivation Contribution & scope Background Platforms Algorithms Results Conclusion & Extension

Slide 32

Conclusion & Future work GPU still in front of different models, most suited for picture preparing applications. Enhancing PS3 could enhance timings to limit the hole amongst its and GPU timings. We could give: Support for quicker shading Canny. Bolster for part width bigger than 5 Better administration of string arrangement in GPU if not a different of 16 Include Intel Xeon & Larrabee as potential models.

Slide 33

Questions..

Slide 34

Additional Slides

Slide 35

CBE Architecture Contains conventional chip, PowerPC Processor Element (PPE) – Controls undertakings 64-bit PPC: 32 KB L1 direction reserve, 32 KB L1 information store, and 512 KB L2 reserve. PPE controls 8 synergistic processor components (SPEs) working as SIMD units Each SPE has a SPU and a memory stream controller (MFC) - information serious assignments SPU (RISC) with 128-piece SIMD registers 256KB neighborhood store (LS). PPE, SPE, MIC, BIC associated by Element Interconnect Bus (EIB) – for information development - ring transport comprising of four 16 byte channels giving managed b/w of 204.8 GB/s. MFC association with Rambus XDR memory and BIC interface to I/O gadgets associated by means of RapidIO give 25.6 GB/s of information b/w.

Slide 36

CBE: What makes it quick? Enormous between SPE transfer speed 205 GB/s maintained yield Fast primary memory 256.5 GB/s transmission capacity for Rambus XDR memory Predictable DMA dormancy and throughput DMA activity has insignificant effect on SPE nearby store transfer speed Easy to cover information development with calculation High execution, low-control SPE centers

Slide 37

Nvidia GeForce (Continued) GPU has K multiprocessors (MP) Each MP has L scalar processors (SP) Each MP performs piece handling in clusters A square is prepared by just a single MP Each piece is part into SIMD gatherings of strings (twists) A twist is executed physically in parallel A scheduler switches between twists A twist contains strings of expanding, back to back string IDs Currently a twist size is 32 strings

Slide 38

CUDA: Programming model Grid of string squares Block (0,0) Block (1,0) Block (2,0) Block (3,0) Grid comprise of string obstructs Each string executes the portion Grid and square measurements indicated by application. Max. by GPU memory 1/2/3-D framework format Thread and Block-IDs are special Block (0,1) Block (1,1) Block (2,1) Block (3,1) Block (2,1) Thread (0,0) Thread (1,0) Thread (3,0) Thread (4,0) Thread (5,0) Thread (0,1) Thread (1,1) Thread (3,1) Thread (4,1) Thread (5,1) Thread (0,7) Thread (1,7) Thread (3,7) Thread (4,7) Thread (5,7) Warp 1 Warp 2

Slide 39

CUDA: Memory demonstrate Shared memory(R/W) - For sharing information inside square Texture memory – spatially stored Constant memory – About 20K, reserved Global Memory – Not reserved, blend Explicit GPU memory alloc/de-allotment Slow duplicating amongst CPU and GPU memory

Recommended
View more...