2D FFT Implementation on Cell Broadband Engine Architecture

2D FFT Implementation on Cell Broadband Engine Architecture
paly

This research article explores the challenges of implementing 2D FFT on the Cell Broadband Engine Architecture, with a focus on optimizing memory bandwidth. By automating implementation details, developers can improve the efficiency of the algorithm and achieve the practical limit of processing. The Cell Broadband Engine Architecture provides a good platform for studying memory bandwidth optimization, with other processor clusters facing similar challenges.

  • Uploaded on | 3 Views
  • dorisorn dorisorn

About 2D FFT Implementation on Cell Broadband Engine Architecture

PowerPoint presentation about '2D FFT Implementation on Cell Broadband Engine Architecture'. This presentation describes the topic on This research article explores the challenges of implementing 2D FFT on the Cell Broadband Engine Architecture, with a focus on optimizing memory bandwidth. By automating implementation details, developers can improve the efficiency of the algorithm and achieve the practical limit of processing. The Cell Broadband Engine Architecture provides a good platform for studying memory bandwidth optimization, with other processor clusters facing similar challenges.. The key topics included in this slideshow are 2D FFT, Cell Broadband Engine, memory bandwidth, algorithm optimization, processor clusters,. Download this presentation absolutely free.

Presentation Transcript


1. Implementation of 2-D FFT on the Cell Broadband Engine Architecture Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren ( wlundgren@gedae.com , Gedae), Kerry Barnes (Gedae), James Steed (Gedae) William Lundgren ( wlundgren@gedae.com , Gedae), Kerry Barnes (Gedae), James Steed (Gedae) HPEC 2009 HPEC 2009

2. Introduction Introduction Processing is either limited by memory or CPU bandwidth Processing is either limited by memory or CPU bandwidth Challenge is to achieve the practical limit of processing Challenge is to achieve the practical limit of processing Large 2-D FFTs are limited by memory bandwidth Large 2-D FFTs are limited by memory bandwidth Automating details of implementation provides developer with more opportunity to optimize structure of algorithm Automating details of implementation provides developer with more opportunity to optimize structure of algorithm Cell Broadband Engine is a good platform for studying efficient use of memory bandwidth Cell Broadband Engine is a good platform for studying efficient use of memory bandwidth Data movements are exposed and can be controlled Data movements are exposed and can be controlled Cache managers hide the data movement Cache managers hide the data movement Intel X86 & IBM Power processor clusters, Larrabee, Tilera, etc. have similar challenges Intel X86 & IBM Power processor clusters, Larrabee, Tilera, etc. have similar challenges 2

3. SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS Cell/B.E. Memory Hierarchy Cell/B.E. Memory Hierarchy Each SPE core has a 256 kB local storage Each SPE core has a 256 kB local storage Each Cell/B.E. chip has a large system memory Each Cell/B.E. chip has a large system memory 3 SPE PPE EIB Cell/B.E. Chip LS SYSMEM Bridge EIB Cell/B.E. Chip SYSMEM Duplicate or heterogeneous Subsystems PPE Bridge SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS

4. # Procs (Times are measured within Gedae) Effect of Tile Size on Throughput Effect of Tile Size on Throughput

5. Limiting Resource is Memory Bandwidth Limiting Resource is Memory Bandwidth Simple algorithm: FFT, Transpose, FFT Simple algorithm: FFT, Transpose, FFT Scales well to large matrices Scales well to large matrices Data must be moved into and out of memory 6 times for a total of Data must be moved into and out of memory 6 times for a total of 2*4*512*512*6 = 12.6e6 bytes 2*4*512*512*6 = 12.6e6 bytes 12.6e6/25.6e9 = 0.492 mSec 12.6e6/25.6e9 = 0.492 mSec - Total flops = 5*512*log2(512) * 2 * 512 = 23.6e6 - Total flops = 5*512*log2(512) * 2 * 512 = 23.6e6 - 23.6e6/204.8e9 = 0.115 mSec - 23.6e6/204.8e9 = 0.115 mSec - Clearly the limiting resource is memory IO - Clearly the limiting resource is memory IO Matrices up to 512x512, a faster algorithm is possible Matrices up to 512x512, a faster algorithm is possible Reduces the memory IO to 2 transfers into and 2 out of system memory Reduces the memory IO to 2 transfers into and 2 out of system memory The expected is time based on the practical memory IO bandwidth shown on the previous chart is 62 gflops The expected is time based on the practical memory IO bandwidth shown on the previous chart is 62 gflops 5

6. Overview of 4 Phase Algorithm Overview of 4 Phase Algorithm Repeat C1 times Repeat C1 times Move C2/Procs columns to local storage Move C2/Procs columns to local storage FFT FFT Transpose C2/Procs * R2/Procs matrix tiles and move to system memory Transpose C2/Procs * R2/Procs matrix tiles and move to system memory Repeat R1 times Repeat R1 times Move R2/Procs * C2/Procs tiles to local storage Move R2/Procs * C2/Procs tiles to local storage FFT FFT Move R2/Procs columns to system memory Move R2/Procs columns to system memory 6

7. Optimization of Data Movement Optimization of Data Movement We know that to achieve optimal performance we must use buffers with row length >= 2048 bytes We know that to achieve optimal performance we must use buffers with row length >= 2048 bytes Complexity beyond the reach of many programmers Complexity beyond the reach of many programmers Approach: Approach: Use Idea Language (Copyright Gedae, Inc.) to design the data organization and movement among memories Use Idea Language (Copyright Gedae, Inc.) to design the data organization and movement among memories Implement on Cell processor using Gedae DF Implement on Cell processor using Gedae DF Future implementation will be fully automated from Idea design Future implementation will be fully automated from Idea design The following charts show the algorithm design using the Idea Language and implementation and diagrams The following charts show the algorithm design using the Idea Language and implementation and diagrams 7 Multicores require the introduction of fundamentally new automation.

8. Row and Column Partitioning Row and Column Partitioning Procs = number of processors (8) Procs = number of processors (8) Slowest moving row and column partition index Slowest moving row and column partition index Row decomposition Row decomposition R1 = Slow moving row partition index (4) R1 = Slow moving row partition index (4) R2 = Second middle moving row partition index (8) R2 = Second middle moving row partition index (8) R3 = Fast moving row partition index (2) R3 = Fast moving row partition index (2) Row size = Procs * R1 * R2 * R3 = 512 Row size = Procs * R1 * R2 * R3 = 512 Column Decomposition Column Decomposition C1 = Slow moving column partition index (4) C1 = Slow moving column partition index (4) C2 = Second middle moving column partition index (8) C2 = Second middle moving column partition index (8) C3 = Fast moving column partition index (2) C3 = Fast moving column partition index (2) Column size = Procs * C1 * C2 * C3 = 512 Column size = Procs * C1 * C2 * C3 = 512 8

9. Notation Ranges and Comma Notation Notation Ranges and Comma Notation range r1 = R1; range r1 = R1; is an iterator that takes on the values 0, 1, R1-1 is an iterator that takes on the values 0, 1, R1-1 #r1 equals R1, the size of the range variable #r1 equals R1, the size of the range variable We define ranges as: We define ranges as: range rp = Procs; range r1 = R1; range rp = Procs; range r1 = R1; range r2 = R2; range r3 = R3; range r2 = R2; range r3 = R3; range cp = Procs; range r1 = R1; range cp = Procs; range r1 = R1; range r2 = R2; range r3 = R3; range r2 = R2; range r3 = R3; range iq = 2; /* real, imag */ range iq = 2; /* real, imag */ We define comma notation as: We define comma notation as: x[rp,r1] x[rp,r1] is a vector of size #rp * #r1 and equivalent to is a vector of size #rp * #r1 and equivalent to x[rp * #r1 + r1] x[rp * #r1 + r1] 9

10. 512x512 Input Matrix Input Matrix Input matrix is 512 by 512 Input matrix is 512 by 512 range r = R; range c = C; range r = R; range c = C; r rp,r1,r2,r3 r rp,r1,r2,r3 c cp,c1,c2,c3 c cp,c1,c2,c3 range iq = 2 range iq = 2 Split complex (re, im) Split complex (re, im) x[iq][r][c] x[iq][r][c] 10 512x512

11. SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 Distribution of Data to SPEs Distribution of Data to SPEs Decompose the input matrix by row for processing on 8 SPEs Decompose the input matrix by row for processing on 8 SPEs [rp]x1[iq][r1,r2,r3][c] = x[iq][rp,r1,r2,r3][c] [rp]x1[iq][r1,r2,r3][c] = x[iq][rp,r1,r2,r3][c] 11 SPE 0 System Memory SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7

12. Stream Submatrices from System Memory into SPE Stream Submatrices from System Memory into SPE Consider 1 SPE Consider 1 SPE Stream submatrices with R3 (2) rows from system memory into local store of the SPE. Gedae will use list DMA to move the data Stream submatrices with R3 (2) rows from system memory into local store of the SPE. Gedae will use list DMA to move the data [rp]x2[iq][r3][c](r1,r2) = [rp]x1[iq][r1,r2,r3][c] [rp]x2[iq][r3][c](r1,r2) = [rp]x1[iq][r1,r2,r3][c] 12 System Memory Local Storage

13. FFT Processing FFT Processing Process Submatrices by Computing FFT on Each Row Process Submatrices by Computing FFT on Each Row [rp]x3[r3][iq][c]= fft([rp]x2[iq][r3]) [rp]x3[r3][iq][c]= fft([rp]x2[iq][r3]) Since the [c] dimension is dropped from the argument to the fft function it is being passed a complete row (vector). Since the [c] dimension is dropped from the argument to the fft function it is being passed a complete row (vector). Stream indices are dropped. The stream processing is automatically implemented by Gedae. Stream indices are dropped. The stream processing is automatically implemented by Gedae. 13 Notice that the real and imaginary data has been interleaved to keep each submatrix in contiguous memory. FFT Processing Each sub matrix contains 2 rows of real and 2 rows of imaginary data.

14. Create Buffer for Streaming Tiles to System Memory Phase 1 Create Buffer for Streaming Tiles to System Memory Phase 1 Collect R1 submatrices into a larger buffer with R2*R3 (16) rows Collect R1 submatrices into a larger buffer with R2*R3 (16) rows [rp]x4[r2,r3][iq][c] = [rp]x3[r3][iq][c](r2) [rp]x4[r2,r3][iq][c] = [rp]x3[r3][iq][c](r2) This process is completed r1 (4) times to process the full 64 rows. This process is completed r1 (4) times to process the full 64 rows. 14 There are now R2*R3 (16) rows in memory. Stream Data Into Buffer

15. Transpose Tiles to Contiguous Memory Transpose Tiles to Contiguous Memory A tile of real and a tile of imaginary are extracted and transposed to continuous memory A tile of real and a tile of imaginary are extracted and transposed to continuous memory [rp]x5[iq][c2,c3][r2,r3](cp,c1) = [rp]x4[r2,r3][iq][cp,c1,c2,c3] [rp]x5[iq][c2,c3][r2,r3](cp,c1) = [rp]x4[r2,r3][iq][cp,c1,c2,c3] This process is completed r1 (4) times to process the full 64 rows. This process is completed r1 (4) times to process the full 64 rows. 15 Now there is a stream of Cp*C1 (32) tiles. Each tile is IQ by R2*R3 by C2*C3 (2x16x16) data elements. Transpose Data Into Stream of tiles

16. Stream Tiles into System Memory Phase 1 Stream Tiles into System Memory Phase 1 Stream tiles into a buffer in system memory. Each row contains all the tiles assigned to that processor. Stream tiles into a buffer in system memory. Each row contains all the tiles assigned to that processor. [rp]x6[r1,cp,c1][iq][c2,c3] = [rp]x5[iq][c2,c3][r2,r3](r1,cp,c1) [rp]x6[r1,cp,c1][iq][c2,c3] = [rp]x5[iq][c2,c3][r2,r3](r1,cp,c1) The r1 iterations were created on the initial streaming of r1,r2 submatrices. The r1 iterations were created on the initial streaming of r1,r2 submatrices. 16 Now there is a buffer of R1*Cp*C1 (128) tiles each IQ by R2*R3 by C2*C3 Stream of tiles into larger buffer in system memory Tile 1 Tile R1*Cp*C1-1

17. Stream Tile into System Memory Phase 2 Stream Tile into System Memory Phase 2 Collect buffers from each SPE into full sized buffer in system memory. Collect buffers from each SPE into full sized buffer in system memory. x7[rp,r1,cp,c1][iq][c2,c3][r2,r3] = x7[rp,r1,cp,c1][iq][c2,c3][r2,r3] = [rp]x6[r1,cp,c1][iq][c2,c3][r2,r3] [rp]x6[r1,cp,c1][iq][c2,c3][r2,r3] The larger matrix is created by Gedae and the pointers passed back to the box on the SPE that is DMAng the data into system memory The larger matrix is created by Gedae and the pointers passed back to the box on the SPE that is DMAng the data into system memory 17 The buffer is now Rp*R1*Cp*C1 (1024) tiles. Each tile is IQ by R2*R3 by C2*C3 Collect tiles into larger buffer in system memory Collect tiles into larger buffer in system memory SPE 0 SPE 1 SPE 7

18. Stream Tiles into Local Store Stream Tiles into Local Store Extract tiles from system memory to create 16 full sized columns (r index) in local store. Extract tiles from system memory to create 16 full sized columns (r index) in local store. [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) = [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) = x7[rp,r1,cp,c1][iq][c2,c3][r1,r2]; x7[rp,r1,cp,c1][iq][c2,c3][r1,r2]; All SPEs have access to full buffer to extract data in a regular but scattered pattern. All SPEs have access to full buffer to extract data in a regular but scattered pattern. The buffer in local store is now Rp*R1 (32) tiles. Each tile is IQ by C2*C3 by R2*R3 (2x16x16). This scheme is repeated C1 (4) times on each SPE. Collect tiles into local store from regular but scattered locations in system memory SPE 0 SPE 1 SPE 7

19. Stream Tiles into Buffer Stream Tiles into Buffer Stream tiles into a buffer in system memory. Each row contains all the tiles assigned to that processor. Stream tiles into a buffer in system memory. Each row contains all the tiles assigned to that processor. [cp]x9[iq][c2,c3][rp,r1,r2,r3](c1) = [cp]x9[iq][c2,c3][rp,r1,r2,r3](c1) = [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) [cp]x8[iq][c2,c3][r2,r3](c1,rp,r1) The r1 iterations were created on the initial streaming of r1,r2 submatrices. The r1 iterations were created on the initial streaming of r1,r2 submatrices. 19 Now there is a buffer of R1*Cp*C1 (128) tiles each IQ by R2*R3 by C2*C3 Stream of tiles into full length column (r index) buffer with a tile copy.

20. Process Columns with FFT Process Columns with FFT Stream 2 rows into an FFT function that places the real and imaginary data into separate buffers. This allows reconstructing a split complex matrix in system memory. Stream 2 rows into an FFT function that places the real and imaginary data into separate buffers. This allows reconstructing a split complex matrix in system memory. [cp]x10[iq][c3][r](c2) = fft([cp]x9[iq][c2,c3]); [cp]x10[iq][c3][r](c2) = fft([cp]x9[iq][c2,c3]); 20 Stream 2 rows of real and imaginary data into FFT function. Place data into separate buffers on output.

21. Create Buffer for Streaming Tiles to System Memory Create Buffer for Streaming Tiles to System Memory Collect R2 submatrices into a larger buffer with R2*R3 (16) rows Collect R2 submatrices into a larger buffer with R2*R3 (16) rows [p]x11[iq][c2,c3][r] = [cp]x10[iq][c3][r](c2) [p]x11[iq][c2,c3][r] = [cp]x10[iq][c3][r](c2) This process is completed c1 (4) times to process the full 64 rows. This process is completed c1 (4) times to process the full 64 rows. 21 There are now R2*R3 (16) rows in memory. Stream Data Into Buffer

22. Create Buffer for Streaming Tiles to System Memory Create Buffer for Streaming Tiles to System Memory Collect R2 submatrices into a larger buffer with R2*R3 (16) rows Collect R2 submatrices into a larger buffer with R2*R3 (16) rows [p]x12[iq][c1,c2,c3][r] = [p]x11[iq][c2,c3][r](c1) [p]x12[iq][c1,c2,c3][r] = [p]x11[iq][c2,c3][r](c1) 22 There are now 2 buffers of C1*C2*C3 (128) rows of length R (512) in system memory. Stream submatrices into buffer

23. Distribution of Data to SPEs Distribution of Data to SPEs Decompose the input matrix by row for processing on 8 SPEs Decompose the input matrix by row for processing on 8 SPEs x13[iq][cp,c1,c2,c3][r] = [cp]x12[iq][c1,c2,c3][r] x13[iq][cp,c1,c2,c3][r] = [cp]x12[iq][c1,c2,c3][r] 23 SPE 0 System Memory SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 Only real plane represented in picture. Each SPE will produce C1*C2*C3 = 64 rows.

24. 512x512 Output Matrix Output Matrix Output matrix is 512 by 512 Output matrix is 512 by 512 y[iq][c][r] y[iq][c][r] 24 512x512

25. Results Results Two days to design algorithm using Idea language Two days to design algorithm using Idea language 66 lines of code 66 lines of code One day to implement on Cell One day to implement on Cell 20 kernels, each with 20-50 lines of code 20 kernels, each with 20-50 lines of code 14 parameter expressions 14 parameter expressions Already large savings in amount of code over difficult handcoding Already large savings in amount of code over difficult handcoding Future automation will eliminate coding at the kernel level altogether Future automation will eliminate coding at the kernel level altogether Achieved 57* gflops out of the theoretical maximum 62 gflops Achieved 57* gflops out of the theoretical maximum 62 gflops * Measured on CAB Board at 2.8 ghz adjusted to 3.2 ghz. The measure algorithm is slightly different and expected to increase to 60 gflops. * Measured on CAB Board at 2.8 ghz adjusted to 3.2 ghz. The measure algorithm is slightly different and expected to increase to 60 gflops. 25

26. Gedae Status Gedae Status 12 months into an 18/20 month repackaging of Gedae 12 months into an 18/20 month repackaging of Gedae Introduced algebraic features of the Idea language used to design this algorithm Introduced algebraic features of the Idea language used to design this algorithm Completed support for hierarchical memory Completed support for hierarchical memory Embarking now on code overlays and large scale heterogeneous systems Embarking now on code overlays and large scale heterogeneous systems Will reduce memory footprint to enable 1024 x 1024 2D FFT Will reduce memory footprint to enable 1024 x 1024 2D FFT 26