• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

AMD Ryzen 7040 Series Phoenix APUs Surprisingly Performant with AVX-512 Workloads

T0@st

News Editor
Joined
Mar 7, 2023
Messages
3,328 (3.83/day)
Location
South East, UK
System Name The TPU Typewriter
Processor AMD Ryzen 5 5600 (non-X)
Motherboard GIGABYTE B550M DS3H Micro ATX
Cooling DeepCool AS500
Memory Kingston Fury Renegade RGB 32 GB (2 x 16 GB) DDR4-3600 CL16
Video Card(s) PowerColor Radeon RX 7800 XT 16 GB Hellhound OC
Storage Samsung 980 Pro 1 TB M.2-2280 PCIe 4.0 X4 NVME SSD
Display(s) Lenovo Legion Y27q-20 27" QHD IPS monitor
Case GameMax Spark M-ATX (re-badged Jonsbo D30)
Audio Device(s) FiiO K7 Desktop DAC/Amp + Philips Fidelio X3 headphones, or ARTTI T10 Planar IEMs
Power Supply ADATA XPG CORE Reactor 650 W 80+ Gold ATX
Mouse Roccat Kone Pro Air
Keyboard Cooler Master MasterKeys Pro L
Software Windows 10 64-bit Home Edition
Intel decided to drop the relatively new AVX-512 instruction set for laptop/mobile platforms when it was discovered that it would not work in conjunction with their E-core designs. Alder Lake was the last generation to (semi) support these sets thanks to P-cores agreeing to play nice, albeit with the efficiency side of proceedings disabled (via BIOS settings). Intel chose to fuse off AVX-512 support in production circa early 2022, with AMD picking up the slack soon after and working on the integration of AVX-512 into Zen 4 CPU architecture. The Ryzen 7040 series is the only current generation mobile platform that offers AVX-512 support. Phoronix decided to benchmark a Ryzen 7 7840U against older Intel i7-1165G7 (Tiger Lake) and i7-1065G7 (Ice Lake) SoCs in AVX-512-based workloads.

Team Red's debut foray into AVX-512 was surprisingly performant according to Phoronix's test results—the Ryzen 7 7840U did very well for itself. It outperformed the 1165G7 by 46%, and the older 1065G7 by an impressive 63%. The Ryzen 7 APU was found to attain the highest performance gain with AVX-512 enabled—a 54% performance margin over operating with AVX-512 disabled. In comparison Phoronix found that: "the i7-1165G7 Tiger Lake impact came in at 34% with these AVX-512-heavy benchmarks or 35% with the i7-1065G7 Ice Lake SoC for that generation where AVX-512 on Intel laptops became common."




Phoronix concluded: "Overall the AVX-512 usage across the AMD Zen 4 product spectrum has been great. The efficient AVX-512 usage on the mobile/laptop processors is great for those developers wanting to work and test code from their device, if wanting to use any AI / deep learning software for edge computing or related use-cases, or just enjoying other AVX-512 optimized software from CPU-based renderers to other creator software packages. Those wishing to go through this round of data I collected for Phoenix / Tiger Lake / Ice Lake can find it here with all the individual per-test metrics."

View at TechPowerUp Main Site | Source
 
Do the Intel processors use dual-pumped 256-bit AVX-512? I know the desktop Zen 4 uses split-consecutive 256-bit, I assume that Phoenix does as well. Or is it mostly clock speed advantage giving the benefit?
 
Do the Intel processors use dual-pumped 256-bit AVX-512? I know the desktop Zen 4 uses split-consecutive 256-bit, I assume that Phoenix does as well. Or is it mostly clock speed advantage giving the benefit?
Zen 4 and Tiger Lake have similar throughput for AVX-512. It's the greater number of cores coupled with much better energy efficiency that leads to the win. The numbers look even better when you notice that the 7840U has nearly half the power draw of the 1165 G7: 15.88 W vs 28.73 W.

Type of OperationZen 4 IPCTiger Lake IPCCascade Lake IPC
256-bit FMA1.901.991.94
512-bit FMA1.000.941.82
512-bit Vector Integer Add1.781.891.94
1:1 Mixed 256-bit and 512-bit FMA1.340.941.82
2:1 Mixed 256-bit and 512-bit FMA1.500.941.82
Execution throughput for various operations (Chips and Cheese)
 
Do the Intel processors use dual-pumped 256-bit AVX-512? I know the desktop Zen 4 uses split-consecutive 256-bit, I assume that Phoenix does as well. Or is it mostly clock speed advantage giving the benefit?

The 13th gen (and 12th gen, if I'm not mistaken) does not, which Intel took away:

1689621507790.png


This is why if one is into certain emulation this time around, it would be best to go with AMD CPUs.
 
The 13th gen (and 12th gen, if I'm not mistaken) does not, which Intel took away:

View attachment 305219

This is why if one is into certain emulation this time around, it would be best to go with AMD CPUs.
Yeah, Raptor Lake never had it and only early Alder Lake samples could enable it.

I was mostly curious about the implementation as AMD was trying to save space wherever possible, and I was curious about the performance delta. Most Intel desktop processors with AVX-512 had a full-width implementation.
 
When are we going to acknowledge that the 7000 series APUs are all that and a stick of Gum too. I am just watching the ETA Prime Video on the newest Handheld and it comes with a 120Hz 1080P screen Gaming is super sweet as well with USB 4 ports as well. I can't wait for these to come to desktop. If they can spare any.
 
Zen 4 and Tiger Lake have similar throughput for AVX-512. It's the greater number of cores coupled with much better energy efficiency that leads to the win. The numbers look even better when you notice that the 7840U has nearly half the power draw of the 1165 G7: 15.88 W vs 28.73 W.

Type of OperationZen 4 IPCTiger Lake IPCCascade Lake IPC
256-bit FMA1.901.991.94
512-bit FMA1.000.941.82
512-bit Vector Integer Add1.781.891.94
1:1 Mixed 256-bit and 512-bit FMA1.340.941.82
2:1 Mixed 256-bit and 512-bit FMA1.500.941.82
Execution throughput for various operations (Chips and Cheese)
Zen 4 doesn't need to lower TDP/clocks to run avx512, that's the main advantage.
 
Zen 4 doesn't need to lower TDP/clocks to run avx512, that's the main advantage.
Tiger Lake also solved that for Intel, but it was their 4th attempt.
 
Do the Intel processors use dual-pumped 256-bit AVX-512? I know the desktop Zen 4 uses split-consecutive 256-bit, I assume that Phoenix does as well. Or is it mostly clock speed advantage giving the benefit?
There is one reliable explanation why Zen4 gets a solid boost from AVX-512. This is because x86 arch. x86 is a fossil trash.
x86 decoders cannot saturate execution units with operations to execute, and using SIMD with longer vector register gives a boost because it requires less commands to decode.
 
There is one reliable explanation why Zen4 gets a solid boost from AVX-512. This is because x86 arch. x86 is a fossil trash.
x86 decoders cannot saturate execution units with operations to execute, and using SIMD with longer vector register gives a boost because it requires less commands to decode.
Even though that's a very jaundiced take on x86, you're right about one thing. The idea that longer vectors are more energy efficient is correct, because the energy associated with instruction fetch and decode is halved for the same amount of work.
 
Well glued together it seems.
 
Back
Top