AMD Rome review - PDF Document

Presentation Transcript

  1. AMD Rome review Martin Cuma, CHPC In this article we look at the performance of the the AMD second generation EPYC CPU, code named Rome, released in August 2019 and compare it to the current Intel competitor, Cascade Lake Xeon, released in April 2019, along with the previous generations of the AMD and Intel CPUs. The 2nd gen EPYC is a refresh of the older EPYC architecture introduced in 2017. While many of the CPU core specifics are similar or the same, there have been significant modifications to both the core and the whole CPU design which have fixed the deficiencies of the 1st gen EPYC and provide considerable speed and memory bandwidth improvements for technical computing. The Rome chips, similar to the previous generation Naples, consist of several multi-core chiplets, as opposed to traditional monolithic CPU designs, as shown in Figure 1. The Rome design consists of 7 nm process CPU chiplets and 14 nm process I/O die, which connects to all the chiplets and creates a more uniform core hierarchy as compared to the Naples (Figure 2). The smaller chiplet production is easier and cheaper than larger monolithic CPU. Figure 1. Monolithic, Naples and Rome CPU layouts, from https://www.nextplatform.com/2019/08/07/amd-doubles-down-and-up-with-rome-epyc-server-chips/ The chiplets include eight CPU cores and are called Core Complex Dies (CCDs). The CCDs communicate with the I/O Die (IOD) via high speed Infinity Fabric links (see Fig 1). The IOD connects to DRAM, PCIe or other CCDs. Each CCD consists of two four-core Core Complexes (CCX), each of which has 16 MB L3 cache. There are two possible NUMA modes on this CPU, most likely enabled by a BIOS change. One considers all cores in each CPU as monolithic (single level NUMA, Figure 3a), while the other has an extra NUMA level on the CCx (dual level NUMA, Figure 3b). On a 32 core CPU (4 CCDs, 8 CCXs), the “numactl -H” output for the first case would then look like: node distances: node 0 1 0: 10 32 1: 32 10 while for the other case it would be: node distances: node 0 1 2 3 4 5 6 7

  2. 0: 10 12 12 12 32 32 32 32 1: 12 10 12 12 32 32 32 32 2: 12 12 10 12 32 32 32 32 3: 12 12 12 10 32 32 32 32 4: 32 32 32 32 10 12 12 12 5: 32 32 32 32 12 10 12 12 6: 32 32 32 32 12 12 10 12 7: 32 32 32 32 12 12 12 10 Figure 2, Naples and Rome NUMA layout, previously at https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen/2 With respect to the CPU core architecture improvements, the most significant in for our purposes is the inclusion of two AVX2 vector units, making it capable of up to 16 double precision FLOPS per cycle, double of what was on Naples. Other architectural changes are well described at https://www.nextplatform.com/2019/08/15/a-deep-dive-into-amds-rome-epyc-architecture/. Other features of note inculde PCI-Express generation 4 support, up to 128 lanes, eight-channel memory controller on CPU each socket, and DDR4 memory speed up to 3200 MHz. Looking at other notable differences from the Intel Cascade Lake architecture, the AMD Rome includes 8 memory channels (compared to 6 in Cascade Lake) and faster memory (3200 MHz vs. 2933 MHz), which results in higher memory bandwidth. The Rome also integrates 128 PCI Gen 3 lanes, while the Skylake CPUs have 48 lanes, which should be beneficial for connection of peripherals like GPUs or network cards. The Intel CPU has two AVX512 vector units per core, as compared to two AVX2 vector units in the AMD CPU. However, the Intel CPU scales down the core frequency considerably when more cores with vector units are being used, while AMD is claiming not to do so, or, at least, not so aggressively. Microway has a great article describing Cascade Lake and its frequency scaling at https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-cascade-lake-sp- intel-xeon-processor-scalable-family-cpus/. Similar article describing the AMD Rome is at https://www.microway.com/knowledge-center-articles/detailed-specifications-of-the-amd-epyc-rome-

  3. cpus/. Our testing below confirms that the AMD CPUs do not scale down the clock speed as much when running at full core utilization, though some frequency scaling is apparent for the higher core count CPU (7702 with 64 cores). Figure 3a. NUMA layout of a 2x32 core (2x EPYC 7452) Rome server in the single NUMA level mode. The two CPUs are on the top of each other. Each L3 cache is shared by 4 cores marking the CCX.

  4. Figure 3b. NUMA layout of a 2x32 core (2x EPYC 7452) Rome server in the dual NUMA level mode. Package is the CPU, NUMANode are the CCDs. Each L3 cache is shared by 4 cores marking the CCX. We got access to a Dell test cluster that included several different Rome CPUs on the PowerEdge C6525 platform. We focused on the 7452 32 core, 2.3 GHz CPUs in a dual socket configuration, and on the 7702 64 core, 2.0 GHz CPU, that was also in a dual socket configuration, but, only looking at a single socket performance. The MSRP of the 7452 is $2025 and of 7702 is $6450, however, a single socket node is supposed to cost considerably less than a dual socket node. Each AMD node had 256 GB of RAM. Intel Cascade Lake nodes are current standard nodes at CHPC, which are dual socket 20 core Xeon Gold 6230 running at 2.1 GHz, at a list price of $1894. In the comparisons, we also include older timings from the AMD Rome and Skylake Xeon Gold 6130 CPUs. External benchmarks Of the benchmark resources on the Internet, good ones include https://www.anandtech.com/show/14694/amd-rome-epyc-2nd-gen, and https://www.phoronix.com/scan.php?page=article&item=amd-epyc-7642, but each only has a handful of HPC like applications. Dell has published their initial benchmarks at https://www.dell.com/support/article/us/en/04/sln319015/amd-rome-is-it-for-real-architecture-and-

  5. initial-hpc-performance?lang=en. This document gives the best performance estimates for our purposes, and we have worked with Dell engineers during our tests to match their performance in the tests that we both ran. Raw and synthetic performance benchmarks STREAM benchmark The STREAM benchmark tests the bandwidth from CPU to the main memory by performing four different operations on large sequential data arrays. We have compiled STREAM using the Intel 2019.5 on the Rome and Cascade Lake and use older data for Skylake and Naples that was built with Intel 2017.4 and gcc 6.3.0, respectively. STREAM is thread parallelized using OpenMP and we look at the memory throughput from one thread to the number of threads equal to the number of the physical cores. As all the machines have NUMA CPUs, we also look at the effect of the thread locality to the CPU core. The chiplet design of the AMD CPUs allows for several different thread placements, as compared to two on the Intel monolithic CPUS: sequential (called compact by Intel OpenMP) - where first all the cores on CPU 0 get filled, followed by CPU 1, and spread (called scatter by Intel OpenMP), where the threads get packed on the two CPU sockets in a round robin fashion. On the Rome CPU, we have looked at five different CPU placements listed below, with the core placements corresponding to dual socket 7702 CPUs: sequential, core placement 0,1,2,3,…, Intel OpenMP options KMP_TOPOLOGY_METHOD=hwloc KMP_AFFINITY=verbose,granularity=fine,compact L3sequential, core placement 0,4,8,12,… - distributing threads across the 4-core shared L3 cache CCXs , Intel OpenMP options KMP_TOPOLOGY_METHOD=hwloc KMP_AFFINITY=verbose,granularity=fine,compact,1 chipletsequential, core placement 0,16,32,48,… - distributing threads across the CCD chiplets, Intel OpenMP options KMP_TOPOLOGY_METHOD=hwloc KMP_AFFINITY=verbose,granularity=fine,compact,2 socketspread, core placement 0,64,1,65,… - distributing threads across the sockets, but sequentially on each socket, Intel OpenMP options KMP_TOPOLOGY_METHOD=cpuinfo KMP_AFFINITY=verbose,granularity=fine,scatter socksequentchipspread – core placement 0,64,16,90,… - distributing threads across sockets and the CCD chiplets – Intel OpenMP options KMP_TOPOLOGY_METHOD=hwloc KMP_AFFINITY=verbose,granularity=fine,compact,3 The Intel OpenMP compiler environment variables that control the thread placement are also listed above. More details on these options is at https://software.intel.com/en-us/cpp-compiler-developer- guide-and-reference-thread-affinity-interface-linux-and-windows. In Figure 4 we look at these five different thread placements, along with using no thread placement at all (= allowing the threads migrate across all the cores on the dual socket machine – the OpenMP runtime default), for the Copy benchmark, which is one of the four that the STREAM contains. They all show similar trends. From this figure, we can see that the thread distribution over the CCXes and chiplets yields the best memory bandwidth on the undersubscribed system. On the opposite side, no thread placement, and sequential placement does not achieve the top bandwidth on the undersubscribed

  6. system. Interestingly, also the socket spread (round robin) placement is not optimal – because it packs the threads tightly on the chiplets rather than distributing them over the chiplets and utilizing each chiplets' memory controller. Dell is reporting the same observation for the Triad benchmark, achieving about 3% better bandwidth that we have observed. Effect of thread placement to the core on the dual socket 7702 Stream copy 350 sequential 0,1,2,3, … 300 L3sequential 0,4,8,12,… Memory bandwidth [MB/s] 250 chipletsequential 0,8,16,24,… 200 socketspread '0,64,1,65,… 150 100 socksequen- tialchipspread 0,64,8,72,… 50 none 0 1 2 4 8 16 32 48 64 96 128 Number of threads Figure 4. Stream Copy on the dual socket Rome 7702 on different thread to core mapping Next, in Figure 5, we compare the STREAM Copy memory bandwidth between the different AMD and Intel CPUs. There are a few things to note. First, both older and new AMD CPUs have higher bandwidth than any of the Intel offerings. The Cascade Lake only has a nominal bandwidth increase over the Skylake at the full subscription (40 threads). And, the Rome 7702's undersubscribed peak bandwidth is higher than that of the 7452, presumably due to more chiplet spread over the memory controllers. Rome's peak bandwidth is 204.8 GB/s per socket, that is 409.6 GB/s per node, though we are only achieving about 330 GB/s on the 7702 and 290 GB/s on the 7452. Supposedly the NUMA clustering has some effect on this but there seem to be a dependence on the CPU as well. The Cascade Lake maximum of ~190 GB/s is also lower than the the per socket 128 GB/s that is 256 GB/node, and also lower than a few other published numbers that went up to 220 GB/s (https://www.dell.com/support/article/us/en/04/sln316864/bios-characterization-for-hpc-with-intel- cascade-lake-processors). We have investigated this further and concluded that this difference is due to a drop in the STREAM bandwidth with increasing array size. The Dell published benchmark used about 5 GB of RAM, while ours used about 96 GB. In Figure 6. we show the dependence of the STREAM on the problem array size on four dual socket nodes, each with a different processor. Notice that all the shown processors, except for the Cascade Lake, have a flat bandwidth curve with increasing array size. High bandwidth values below ca. 100 MB are due to the CPU caching. Lastly, we simulate the single socket 7702 bandwidth based on the thread placements in the different pinnings we did and observe half the bandwidth of the 2 socket node, as expected, roughly equivalent

  7. to the Skylake bandwidth. Stream Copy 350000 300000 Memory bandwidth [MB/s] 250000 Naples Rome 7452 Rome 7702 Rome 7702 1P Broadwell Skylake Cascade Lake 200000 150000 100000 50000 0 1 2 4 8 16/14 32/28/20 64/40 128 Thread count Figure 5. STREAM Copy maximum bandwidth per thread at optimal thread to core distribution STREAM Triad 400000 350000 300000 Memory bandwidth [MB/s] 250000 200000 150000 100000 50000 0 1 10 100 1000 10000 100000 1000000 Array size [MB] Cascade Lake 6230 Rome 7452 Nehalem E5-2670 Naples 7601 Figure 6. Dependence of STREAM Triad on the array size.

  8. HPL score and cost per MFLOP 4.50 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 Rome 2x7452 Rome 2x7702 Casc. Lk. 6230 Rome 7702 Casc. Lk. 6230 MKL HPL Cost per MFLOP TFLOPs Figure 7. HPL score and cost pet MFLOP High Performance Linpack (HPL) benchmarks HPL is a part of the HPCC detailed below, but, we also used it separately since we have observed lower than expected HPL value for the Cascade Lake in the HPCC benchmark. After communication with Dell about their HPL score of about 2 TFLOPs, we have nearly replicated this value by running the Intel optimized HPL binary that ships with the Intel MKL library, obtaining 1.862 TFLOPs running 4 MPI tasks, 10 threads each on the 6230's 40 cores. This is about 73% of the 2.56 TFLOPs theoretical peak. On the AMDs, the 7452s really shine getting 2.43 TFLOPs running 64 tasks per node, a 103% of the 2.355 TFLOPs theoretical. Two 7702s get 3.355 TFLOPs running 32 tasks, 4 threads each, a 82% of the 4.096 TFLOPs theoretical. Based on these numbers it seems that the clock is getting throttled more due to the heat on the higher core count 7702. Similarly, single 7702 achieved 1.633 TFLOPs, 80% of its peak Based on these observations, we conclude that the Intel CPU scales its frequencies due to the heat issues the most, followed by the high core count AMD 7702. The AMD 7452 achieves effective cooling even when all the cores are being used. This supports the AMD unwritten claim that the CPU clock speed is not throttled significantly when using vectorization at high core count. The explanation for the 7702 slower performance may be twofold – more CPU clock speed throttling as the 64 core CPU cooling must be more difficult than the 32 core one, and also limited memory access bandwidth since the 64 7702 cores have the same memory bandwidth as the 32 7452 cores. The Cascade Lake continues the trend set by previous Intel processor generation, where the CPU clock speed gets throttled significantly with the use of vectorization at high core count. In Figure 7 we summarize the HPL scores and normalize them to cost per MFLOP, based on the list price of the CPU. Note that the price does not include other node infrastructure cost, which would possibly favor more the single socket AMD 7702P. So, this chart needs to be taken with a grain of salt, or adjusted for the real node costs. In either case, we can see that the AMD 2x7452 comes out as the best, followed by the Intel 2x6230, followed by the AMD 7702P. The AMD 7702 dual socket processor is much more expensive, and better comparable to the high end Intel SKUs which are overpriced with

  9. respect to the performance as well. High Performance Computing Challenge (HPCC) benchmark HPCC benchmark is a synthetic benchmark suite geared at assessing HPC performance from different angles. It consists of seven main benchmarks that stress various computer subsystems, such as raw performance, memory access and communication. For detailed description of the benchmark see http://icl.cs.utk.edu/hpcc/. We use version 1.5.0. On the latest AMD and Intel CPUs, we have built HPCC with Intel 2019.5 compiler and the corresponding MKL and Intel MPI. On the Intel platforms, we used flags -O3 -ansi-alias -ip -axCORE- AVX512,CORE-AVX2,AVX -restrict and on the AMD Rome flags -O3 -ansi-alias -ip -march=core- avx2 -restrict. For the older Skylake and Broadwell, we have built HPCC 1.5.0 with Intel 2017.4 compilers and the corresponding Intel MKL and MPI using the same compiler flags as on the Cascade Lake. On the Epyc, we used gcc 6.3.0 with BLIS and -O3 -fomit-frame-pointer -funroll-loops -march=native. Also of note is that we had to use undocumented MKL environment variable MKL_DEBUG_CPU_TYPE=5, which turns on specific vectorization instructions in the MKL, and results in about 20% speedup in HPL and other codes that use MKL BLAS. Year CPU generation Core count Frequency_GHz HPL_Tflops StarDGEMM_Gflops SingleDGEMM_Gflops PTRANS_GBs MPIRandomAccess_GUPs StarRandomAccess_GUPs SingleRandomAccess_GUPs StarSTREAM_Triad SingleSTREAM_Triad StarFFT_Gflops SingleFFT_Gflops MPIFFT_Gflops 2019 2019 Rome 64 1x64 2 1.63 27.80 49.20 11.56 0.23 0.00 0.12 1.62 20.27 0.95 1.59 24.97 2019 Casc. Lk. 2x20 2.1 1.66 48.36 60.11 14.92 0.15 0.02 0.04 3.98 14.80 1.79 2.49 28.79 2017 2017 Skylake 2x16 2.1 1.64 54.04 56.09 13.94 0.0026 0.0397 0.0787 4.55 12.57 2.06 2.75 29.88 2016 Broadwell 2x14 2.4 0.85 31.98 41.41 10.84 0.0037 0.0304 0.0825 3.26 10.55 1.67 2.31 11.93 2014 Haswell 2x12 2.5 0.73 31.83 41.72 7.39 0.0266 0.0256 0.0778 2.55 12.93 1.53 2.38 8.53 2012 SandyBr. 2x8 2.2 0.27 17.08 20.30 4.62 0.0171 0.0292 0.0611 3.42 12.50 1.51 2.03 7.90 2010 Rome 2x32 2x32 2.3 2.43 40.91 51.11 14.79 0.34 0.01 0.12 2.84 20.62 1.39 1.69 37.79 Naples 64 2x32 2.0 1.03 17.57 18.48 11.72 0.092 0.028 0.093 2.90 19.65 0.96 1.23 20.38 Westmere 2x6 2.8 0.12 10.46 10.71 3.05 0.0427 0.0196 0.0366 2.48 10.25 1.22 1.95 4.64 Table 1. HPCC results, the higher the value the better. The best values shown in bold. In Table 1 we show the result of select HPCC metrics for select fully loaded nodes Intel Xeon CPUs since 2010 and the two generations of the AMD EPYC CPUs. The dual socket Rome 7452 node has taken the lead in the HPL, at a fairly impressive ratio over the Cascade Lake node. A single socket 7702 node has roughly the same performance as the Cascade Lake. Note that the Cascade Lake performance is using our binary, not the optimized one from Intel, running one task per CPU core – which is why we are getting roughly 10% lower score. High Performance Linpack benchmark - the one that's used for Top500 - measures the floating point rate of execution for solving a linear system of equations. HPL_Tflops

  10. Parallel DGEMM - measures the floating point rate of execution of double precision real matrix-matrix multiplication. Serial DGEMM - on single processor Parallel Matrix Transpose - exercises the communications where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network. StarDGEMM_Gflops SingleDGEMM_Gflops PTRANS_GBs MPIRandomAccess_GUP s StarRandomAccess_GUPsUPC Parallel Random Access - measures the rate of integer random updates of memory (GUPS). SingleRandomAccess_GU Ps Parallel STREAM - a simple synthetic benchmark program that measures sustainable memory bandwidth (in GB/s) and the corresponding computation rate for simple vector kernel. SingleSTREAM_Triad Serial STREAM Parallel FFT - measures the floating point rate of execution of double precision complex one-dimensional Discrete Fourier Transform (DFT). SingleFFT_Gflops Serial FFT MPIFFT_Gflops MPI FFT Table 2. HPCC benchmarks explanations. MPI Parallel Random Access Serial Random Access StarSTREAM_Triad StarFFT_Gflops To visualize the improvement in floating point performance, in Figure 8 we show the High Performance Linpack (HPL) performance of the current and previous AMD and Intel CPU generations, which is exemplifies the change in the floating point (FP) vectorization units. The 2010 Westmere CPU had SSE4.2 vectorization set capable of doing 2 double precision operations (DPO) per cycle. This has doubled to 4 DPO/cycle in 2012 SandyBridge with the AVX instruction set. The 2014 Haswell's AVX2 added Fused Multiply Add (FMA) instruction, which, along with the increase in core count and clock speed as compared to our benchmarked SandyBridge more than doubled the floating point output. Broadwell CPU was a process shrink of Haswell so the extra performance was added mainly by the increased core count. Going to Skylake, we are seeing another doubling of FP performance with the 8 DP long AVX512 instruction set. The Cascade Lake performance improvement over Skylake is minimal, likely due to the similar memory bandwidth and CPU clock speed throttling. Again this is building the HPL from the source using Skylake/Cascade Lake compiler optimization. Dell Labs published results on the same Cascade Lake processors reaching up to 2 TFLOPs, https://www.dell.com/support/article/us/en/04/sln316864/bios-characterization-for-hpc-with-intel- cascade-lake-processors?lang=en, we achieved 7.5% less with Intel optimized HPL binary running in multi-threaded mode, but still 10% more than what we are showing in Figure 8. Based on discussion with Dell the 7.5% lower perfomance with the Intel optimized HPL binary is within a range of roughly 15% performance difference they have noticed during their tests. Also, the Skylake performs better than the Cascade Lake in a few tests. My guess here would be potential effects of the power governors in the system or the BIOS – in the Cascade Lake we saw about 8% HPL difference between them and the HPCC runs used the less aggressive ones. On the AMD side, the first generation Epyc had only one AVX2 unit capable of 8 FLOPs per cycle, and its 1.03 GFLOPs is close to the theoretical peak. Both Rome chips pulled ahead significantly, with the

  11. 2x7452 more than doubling the HPL throughput as compared to the first CPU generation. 3.00 2.50 HPL per node [TFlops] 2.00 1.50 1.00 0.50 0.00 Rome 64 Epyc 64 Broadwell SandyBr. Rome 2x32 Casc. Lk. Skylake Haswell Westmere CPU generation Figure 8. Top HPL performance for the Epyc and select Intel CPU generations. Higher value is better. The other HPCC benchmarks paint further interesting points on the AMD vs Intel performance. Single core dense linear algebra is a strong point of the Intel thanks to its wider vector unit – as evident from the DGEMM values. The same goes, somewhat surprisingly, to the FFT. Memory bandwidth is the strong point of the AMD, seen from the STREAM numbers. The MPI benchmarks probably benefit the most from the higher AMD core count, so the AMD is better. The single socket 7702 is not a winner anywhere but its performance is within the range. NAS Parallel Benchmarks NAS Parallel Benchmarks are a set of programs derived from computational fluid dynamics (CFD) applications. Some basic information about the benchmarks is here: https://en.wikipedia.org/wiki/NAS_Parallel_Benchmarks. Each of these benchmarks can be run with different problem sizes. Class A is a small problem, Class B is medium size, Class C is a large problem, and Class D is a very large problem (needing about 12 GB of RAM). There are also even larger classes E and F. We have ran Classes A-D and present results for Class C. We have compiled the codes with Intel 2019 and 2017 compilers, using "-O3 -ipo -axCORE-AVX512 -qopenmp" option on the Cascade Lake and Skylake, respectively, and "-O3 -ipo -axCORE-AVX2 -qopenmp" option on the Haswell. On the AMD for the older Naples chip, we used gcc 6.3.0 with “-O3 -fopenmp -mcmodel=medium -mavx2 -mfma4” flags. On the Rome, we used Intel compiler 2019.5 with options “-O3 -ipo -march=core-avx2 -qopenmp”.

  12. 350 300 250 200 150 100 50 0 ua is ep Rome 7452 1th Skylake 1th Rome 7702 1th Naples 1th Casc.L. 1th Broadwell 1th F i g u r e 9 a . S i n g l e c o r e ( o n e t h r e a d ) N A S U A , I S a n d E P b e n c h m a r k s f o r s i z e C All the NAS benchmark plots compare the performance in Mops/sec or Mops/sec/thread. As we are looking at comparing maximum performance on the whole multi-core machine, and also evaluating the SMP capabilities, below we look at the Mops/sec. The higher is the Mops/sec count, the better. We present the benchmarks in four graphs broken by the single thread and whole node performance, and by similar values of Mops/sec, for better comparison. The NAS parallel benchmarks cover a wide variety of algorithms and as such their performance varies both with the CPU generations and across the different CPU manufacturers. Benchmarks like the UA (Unstructured Adaptive) or MG (MultiGrid) do not vectorize as much and therefore their single core performance stays similar across the CPU generations. Other benchmarks, such as the FT (Fast Fourier Transform), EP (Embarrassingly Parallel random numbers), or even IS (Integer Sort) improve significantly with newer CPU generations, benefiting either from increased memory bandwidth, or from vectorization. On the single core basis – the two AMD Rome CPUs are more or less comparable. They also beat the Cascade Lake CPU on all but 3 benchmarks (EP, CG, BT) Mo v i n g t o t h e w h o l e n o d e g r a p h s , i n m o s t c a s e s w e c a n s e e t h e e ff e c t s o f t h e i n c r e a s i n g c o r e c o u n t . C o m p a r i n g t h e A MD R o m e t o I n t e l C a s c a d e L a k e , t h e A MD i s a w i n n e r i n e i g h t o f t h e b e n c h m a r k s w h i l e I n t e l w i n s t w o ( B T , E P ) . T h e s i n g l e R o m e 7 7 0 2 C P U i s b e t t e r t h a n t h e d u a l C P U C a s c a d e L a k e n o d e i n 6 o u t o f t h e 9 b e n c h m a r k s . T h e 7 7 0 2 t a k e s a l a r g e p e r f o r m a n c e h i t i n m e m o r y i n t e n s e b e n c h m a r k s l i k e t h e F T o r I S , w h i l e f o r t h e m o r e c o m p u t a t i o n a l l i k e L U o r S P t h e d i ff e r e n c e f r o m t h e d u a l s o c k e t 7 4 5 2 s y s t e m i s s m a l l e r .

  13. 6,000 5,000 4,000 3,000 2,000 1,000 0 ua is ep Rome 7452 64th Naples 64th Rome 7702 64th Skylake 32th Casc.L. 40th Broadwell 28th F i g u r e 9 b . Wh o l e n o d e N A S U A , I S a n d E P b e n c h m a r k s f o r s i z e C 7,000 6,000 5,000 4,000 3,000 2,000 1,000 0 cg sp ft lu bt mg Rome 7452 1th Naples 1th Rome 7702 1th Skylake 1th Casc.L. 1th Broadwell 1th F i g u r e 1 0 a . S i n g l e c o r e ( o n e t h r e a d ) N A S C G , F T , S P , L U , B T a n d MG b e n c h m a r k s f o r s i z e C

  14. 160,000 140,000 120,000 100,000 80,000 60,000 40,000 20,000 0 cg sp ft lu bt mg Rome 7452 64th Naples 64th Rome 7702 64th Skylake 32th Casc.L. 40th Broadwell 28th F i g u r e 1 0 b . Wh o l e n o d e N A S C G , F T , S P , L U , B T a n d MG b e n c h m a r k s f o r s i z e C Real applications benchmarks LAMMPS LAMMPS is a popular molecular dynamics simulation program developed at Sandia National Laboratory. It is a good representative for multi-body like simulations that use internally coded computational kernels, not relying so much on vendor accelerated libraries. We have built the 31Mar17 version using Intel 2019.5 or 2017 compilers, MPI and MKL (using MKL's FFTW wrappers) and with optimization flags "-axCORE-AVX512,CORE-AVX2,AVX,SSE4.2 -O3 -prec-div -fp-model precise". On the AMD Naples, we used the fat Intel built binary as used on the Skylake. On the AMD Rome, we used Intel 2019.5 with flags “-march=core-avx2 -ip -prec-div -fp- model precise”. The rest of the flags were taken from the USER-INTEL package makefile. We have run three LAMMPS benchmarks from http://lammps.sandia.gov/bench.html: LJ = atomic fluid, Lennard-Jones potential with 2.5 sigma cutoff (55 neighbors per atom), NVE integration Chain = bead-spring polymer melt of 100-mer chains, FENE bonds and LJ pairwise interactions with a 2^(1/6) sigma cutoff (5 neighbors per atom), NVE integration EAM = metallic solid, Cu EAM potential with 4.95 Angstrom cutoff (45 neighbors per atom), NVE integration Each problem was scaled 2x in each dimension resulting in 256,000 atoms and was run for 1,000 time steps. In Table 2 we show the benchmark results for the last two generations of the AMD and Intel CPUs, with the bold number being the whole node runtime in seconds – that is what an user would typically

  15. run. All runs were run with MPI tasks only, single OpenMP thread. LAMMPS benefits from high core count and as such the Rome 7452 gives almost 2x advantage over the Intel Cascade Lake node. Comparing the dual socket 7452 with single socket 7702P, the performance hit is 12-15%. One thing to keep in mind is that we built the LAMMPS fairly standardly without additional packages. We have looked at the USER-OPENMP and KOKKOS packages, which provide thread based parallelism in LAMMPS on the top of the default MPI parallelism, but, we did not get better performance as compared to the pure MPI runs. This is reasonable as molecular dynamics codes generally are not as communication heavy, so the thread based overhead sticks out more. Procs NP 95.24 48.49 23.07 10.91 5.45 3.32 1.82 RM1 7452 80.50 36.55 17.91 8.83 4.64 2.38 1.30 RM2 7702 78.71 42.08 23.93 10.63 5.19 2.45 1.47 SKL 74.26 35.04 17.74 9.37 4.80 2.89 CL 71.21 33.23 17.38 9.02 4.65 3.92 2.32 RM1/CL 1.13 1.10 1.03 0.98 1.00 0.61 0.56 RM2/CL 1.11 1.27 1.38 1.18 1.11 0.62 0.64 1 2 4 8 16 32/20 64/40 Table 2a. LAMMPS chain benchmark performance (in seconds, lower is better) and speedup ratio of the Rome 7452 node with respect to the Cascade Lake node and the 7702P single socket node. NP RM1 7452 357.19 182.78 92.34 46.90 24.06 12.22 6.32 RM2 7702 354.38 181.91 95.47 49.37 25.56 12.57 7.25 SKL 305.00 155.34 80.93 43.73 22.75 13.87 CL RM1/CL 1.21 1.23 1.15 1.10 1.04 0.62 0.55 RM2/CL 1.20 1.22 1.19 1.15 1.11 0.63 0.63 1 2 4 8 437.37 223.54 113.86 58.13 29.35 17.33 9.00 294.20 149.19 80.49 42.79 23.08 19.81 11.44 16 32/20 64/40 Table 2b. LAMMPS eam benchmark performance (in seconds, lower is better) and speedup ratio of the Rome 7452 node with respect to the Cascade Lake node and the 7702P single socket node. NP RM1 7452 135.34 69.12 34.47 17.59 8.93 4.57 2.36 RM2 7702 134.81 69.91 37.43 19.11 9.78 4.84 2.82 SKL 117.24 59.08 31.01 16.76 8.66 5.22 CL RM1/CL 1.17 1.18 1.09 1.05 1.02 0.62 0.54 RM2/CL 1.17 1.19 1.19 1.14 1.12 0.66 0.64 1 2 4 8 166.38 85.96 44.05 22.22 11.03 6.48 3.36 115.33 58.78 31.54 16.75 8.77 7.31 4.39 16 32/20 64/40 Table 2c. LAMMPS lj benchmark performance (in seconds, lower is better) and speedup ratio of the Rome 7452 node with respect to the Cascade Lake node and the 7702P single socket node. VASP VASP is a plane wave electronic structure program that is widely used in solid state physics and materials science. As with many quantum simulation codes, VASP uses dense linear algebra heavily

  16. through the BLAS – LAPACK – ScaLAPACK libraries. It thus provides a convenient benchmarking tool for vendor supplied accelerated libraries like the MKL. We have compiled VASP 5.4.4 with Intel 2017 or 2019 (for the Cascade Lake) compilers, MKL and MPI, and "-O2 -axCORE-AVX512,CORE-AVX2,AVX,SSE4.2" compiler flags on the Intel machines and with “-O2 -march=core-avx2” on the AMD machines. In all cases we started with the VASP supplied makefile.include.linux_intel make flags, and used MKL for all the external libraries, including FFTW. We present two benchmarks of semiconductor based systems, Si and SiO, the SiO being several times larger. The smallest system is slowly becoming less relevant as both the hardware and the software improve, so, in our explanations we focus on the larger problem. As with the HPCC, we include results we obtained on previous generation of processors in Table 3, though, beware that the older CPUs were run with older VASP version which was potentially less optimized. The results are runtime in seconds, the smaller the number the better. (Si 12 layer, 24 at., 16 kpts, 60 bnds) CPUs Westmere-EP 2.8 12c Sandybridge 2.2 16c Haswell 2.5 20c Broadwell 2.4 28c Skylake 2.1 32c Cascade Lake 2.1 40c Naples 2.0 64c Rome 7452 2.3 64c Rome 7702 2.0 64c Rome 7452 vs. CasL Rome 7702 vs. CasL 1 2 4 8/12 47.13 36.17 22.13 19.25 15.50 10.10 18.62 10.01 19.56 0.99 1.20 16 24 28/32 40/64 233.49 195.83 118.02 108.46 80.41 46.87 118.58 60.89 61.22 1.30 1.47 123.05 102.24 56.70 55.31 41.60 23.64 61.88 32.53 36.42 1.38 1.49 68.79 56.15 34.58 30.06 22.78 14.09 32.35 17.25 25.00 1.22 1.42 36.71 15.74 12.84 11.33 7.33 11.03 6.30 16.56 0.86 0.97 27.06 13.52 11.08 13.85 9.30 7.90 8.99 4.94 9.81 0.63 0.97 8.54 8.22 5.08 6.69 0.59 0.88 12.19 (Si192+O, 4 kpts, 484 bnds) CPUs Westmere-EP 2.8 12c Sandybridge 2.2 16c Haswell 2.5 20c Broadwell 2.4 28c Skylake 2.1 32c Cascade Lake 2.1 40c Naples 2.0 64c Rome 7452 2.3 64c Rome 7702 2.0 64c Rome 7452 vs. CasL Rome 7702 vs. CasL 1 2 4 8/12 175.22 128.79 76.69 55.61 45.29 44.37 75.16 36.78 104.09 0.83 2.35 16 24 28/32 40/64 999.36 771.53 424.72 395.01 278.25 266.72 549.13 277.34 275.89 1.04 1.03 514.66 396.33 187.93 163.62 144.49 148.63 269.80 147.55 176.02 0.99 1.18 330.20 215.07 116.83 91.65 75.63 78.75 138.39 73.21 128.42 0.93 1.63 120.68 57.79 41.63 32.17 30.85 42.80 26.34 58.39 0.85 1.89 41.52 34.36 26.25 35.09 27.92 27.26 35.23 22.08 23.80 0.81 0.87 23.76 33.55 20.94 21.11 0.88 0.89 38.54 Table 3. VASP performance in seconds (lower is better) With respect to the per core performance, the larger system benchmark runtime is comparable between the Rome and Cascade Lake CPUs, while for the smaller one the Cascade Lake runs faster. This suggests that more computation in the larger system helps the AMD processor – as VASP is heavy in dense linear algebra, provided by the MKL library. Looking at the whole system performance, both Rome CPUs get 11-12% advantage over the Cascade Lake – though note that even the larger system is

  17. getting small for the 64 CPUs to scale. We should note that we have used the undocumented MKL_DEBUG_CPU_TYPE=5 environment variable to get better MKL performance – without it the performance was 10-20% worse. We also tried multi-threaded mode on the dual socket 7452 but the performance using more than one thread per MPI task, filling up the node with tasks/threads, was slightly worse than using 64 single threaded MPI tasks. Conclusions The AMD Rome CPU was promising to deliver a shake to the CPU industry and it did. Comparably priced two Rome 7452 CPUs perform better than their Intel 6230 counterpart in many benchmarks, sometimes, like in the LAMMPS example by almost 2x. The situation is not as clear-cut with the single socket 64 core Rome 7702P, which we have tried to simulate by running on a single socket of a dual socket 7702 node. Some benchmarks like the HPL get hit by the lower memory bandwidth that the single socket solution provides, and likely also more frequency throttling due to the lower heat dissipation. Nevertheless, in both real applications we tested the single AMD 7702 processor is faster than a Intel equipped machine with two 6230 CPUs. Finally, the price point at which we can obtain machines with these CPUs will be an important factor as well. At the time of the writing, CHPC can obtain roughly the same price for the single CPU AMD 7702P node with 256 GB of RAM as for the dual CPU Intel 6230 node with 192 GB of RAM, while the cost of the dual CPU AMD 7452 node with 256 GB RAM is 50% higher. With this pricing in mind, the single CPU AMD 7702P node comes out as the best choice.