• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.

First Signs of AMD Zen 3 "Vermeer" CPUs Surface, Ryzen 7 5800X Tested

Joined
Oct 18, 2019
Messages
13 (0.01/day)
Location
Germany
System Name Megaporto
Processor i7-8700
Motherboard MSI H310M Pro-D
Cooling Xilence I250PWM
Memory 2 x 8 GB Crucial DDR4-2400 MHz @ 2666 MHz / ID: CT8G4DFS824A.C8FE
Video Card(s) KFA2 RTX 2080 Super EX ("Galax" being named KFA2 in EU) @+70MHz Core / +1400 MHz mem
Storage SSD_1: ADATA SU800 256 GB / SSD_2: Samsung 860 QVO 1 TB / HDD_1: Toshiba DT01ACA100 1 TB
Display(s) BenQ EL2870U / 4K / 60 Hz / TN
Case Intertech Q2 Illuminator Blue (modified by Megaport: +branding / +red instead of blue fans)
Audio Device(s) Bose Companion Series III
Power Supply Corsair CX750M
Mouse Hama Mirano (Black)
Keyboard VicTsing Model PC116A
Benchmark Scores 3D Mark Time Spy: 10827 total / 11885 GPU / 7198 CPU Cinebench R20 (multi / single) : 3405 / 456
*RDNA 1 and Zen 3 naming scheme consumer confusion incoming*
But by the time those cpus come out RDNA 1 is gonna be mostly irrelevant anyway.

If that turns out to be true an all AMD truly high-end gaming rig might become a reality for the first time since like... decades this end of year.
 
Joined
Oct 10, 2018
Messages
943 (0.46/day)
Just waiting for 8 core 16 thread cpu that can boost to mininum 5 ghz on all cores and stay there on air cooling solution without overclock. I would prefer if it is AMD since they have clock advantage per clock. Till then, i see no point upgrading at least for my main system.
 
Joined
May 31, 2017
Messages
877 (0.35/day)
Location
Home
System Name Blackbox
Processor AMD Ryzen 7 3700X
Motherboard Asus TUF B550-Plus WiFi
Cooling Scythe Fuma 2
Memory 2x8GB DDR4 G.Skill FlareX 3200Mhz CL16
Video Card(s) MSI RTX 3060 Ti Gaming Z
Storage Kingston KC3000 1TB + WD SN550 1TB + Samsung 860 QVO 1TB
Display(s) LG 27GP850-B
Case Lian Li O11 Air Mini
Audio Device(s) Logitech Z200
Power Supply Seasonic Focus+ Gold 750W
Mouse Logitech G305
Keyboard MasterKeys Pro S White (MX Brown)
Software Windows 10
Benchmark Scores It plays games.
I remember the debate, as no one uses 720 unless we are comparing low power CPUs in tablets.

So feel free to use a resolution that's unused for a comparison of you feel better about it. But it's like comparing which jet fighter is better at being a submarine. Or which sports car does best off-road. Or which network switch makes the best cricket bat.
If you remember the debate, then surely you remember the reason why tests at lower resolutions are relevant. And again, I said 1080p and 1440p.

Again, you addressed one point, that is meaningless. Any thoughts on core counts, memory latency, power consumption? Typically AMD gets you more overall performance for the dollar.
I addressed the only point relevant in the article. It's a gaming benchmark. At 4K. Which says very little of CPU performance and that was my point. It doesn't say anything about Zen 3's memory latency, power consumption or prices, so I have no opinion on those.
 
Joined
Feb 11, 2009
Messages
5,404 (0.97/day)
System Name Cyberline
Processor Intel Core i7 2600k -> 12600k
Motherboard Asus P8P67 LE Rev 3.0 -> Gigabyte Z690 Auros Elite DDR4
Cooling Tuniq Tower 120 -> Custom Watercoolingloop
Memory Corsair (4x2) 8gb 1600mhz -> Crucial (8x2) 16gb 3600mhz
Video Card(s) AMD RX480 -> RX7800XT
Storage Samsung 750 Evo 250gb SSD + WD 1tb x 2 + WD 2tb -> 2tb MVMe SSD
Display(s) Philips 32inch LPF5605H (television) -> Dell S3220DGF
Case antec 600 -> Thermaltake Tenor HTCP case
Audio Device(s) Focusrite 2i4 (USB)
Power Supply Seasonic 620watt 80+ Platinum
Mouse Elecom EX-G
Keyboard Rapoo V700
Software Windows 10 Pro 64bit
*RDNA 1 and Zen 3 naming scheme consumer confusion incoming*
But by the time those cpus come out RDNA 1 is gonna be mostly irrelevant anyway.

If that turns out to be true an all AMD truly high-end gaming rig might become a reality for the first time since like... decades this end of year.

Well in fairness an Intel or Nvidia truely high end gaming rig has never been a reality.
 
Last edited:
Joined
Jan 27, 2015
Messages
1,649 (0.49/day)
System Name Legion
Processor i7-12700KF
Motherboard Asus Z690-Plus TUF Gaming WiFi D5
Cooling Arctic Liquid Freezer 2 240mm AIO
Memory PNY MAKO DDR5-6000 C36-36-36-76
Video Card(s) PowerColor Hellhound 6700 XT 12GB
Storage WD SN770 512GB m.2, Samsung 980 Pro m.2 2TB
Display(s) Acer K272HUL 1440p / 34" MSI MAG341CQ 3440x1440
Case Montech Air X
Power Supply Corsair CX750M
Mouse Logitech MX Anywhere 25
Keyboard Logitech MX Keys
Software Lots
The fact that the benchmark was run at 4K rather than 1440p or 1080p is a little suspicious. And the fact that while having much higher cpu frames, it was still marginally behind in actual framerate in 2 out of 3.

That - why run a game at 4k as a CPU test as that gets GPU limited - and also that Ashes was developed in partnership with AMD, and originally ran much better on R9 290s than on any of that generation's Nvidia cards.

Then you have that the 10900k only lost in synthetic 'cpu framerate', it won in 2 out of 3 on actual framerate (which is what you'd actually see)...

This really looks more like a planned marketing stunt than an objective benchmark to me. We will know in a few weeks either way.
 
Joined
Mar 21, 2016
Messages
2,198 (0.74/day)
monolithic die



Lower latency



None of it seems to do anything for Ryzen.
Yeah it's a odd discrepancy at first glance 16GB vs 32GB. It would seem that TUP APISAK might've chosen that comparison to show AMD's performance with a higher density module in play to not only highlight the higher performance of the AMD chip, but also glean into memory latency playing a role with it. The highest density ram modules often require looser latency which could what is being represented here. If the performance advantages on the new Ryzen chip being portrayed here is coming from the larger ram density that would be the worst case scenario and a bit unlikely, but with a limited amount of benchmarks to compare between both chips paired with that GPU module could perhaps be the case. This could simply be the closest comparison that could be compared at present by the leaker tough to say.

That - why run a game at 4k as a CPU test as that gets GPU limited - and also that Ashes was developed in partnership with AMD, and originally ran much better on R9 290s than on any of that generation's Nvidia cards.

Then you have that the 10900k only lost in synthetic 'cpu framerate', it won in 2 out of 3 on actual framerate (which is what you'd actually see)...

This really looks more like a planned marketing stunt than an objective benchmark to me. We will know in a few weeks either way.
My take on it is this 4K is actually more CPU computational than 1080p, but it's a harder and less exciting to benchmark and account for. It would be interesting perhaps to place a 30FPS/45FPS/60FPS GPU limit and do some PhysX testing assigned to the CPU across 1080p up thru 8K and seeing what the scaling is ends up like and if it's linear or more non-linear. I don't see how it could be and seems it would vary and fluctuate a lot depending on the type of scene. It would be rather insightful and interesting see which things present more bottlenecks in the CPU design more for PhysX as well. Seeing just how much multi-core performance impacts PhysX would be cool a well that might show a upside to AMD's design if heavy use of PhysX can be exploited by developers. If there is advantages to the multi-core approach for stuff like PhysX it just goes to show you AMD's approach should only continue to blossom further in those area's moving forward especially true since Intel has followed suit in order to try to keep pace with it. If anything that's a clear indicator that Intel knows the vital importance of the multi-core design approach and if they simply stuck with a quad core they'd already be left in the dust. In fact I want to see how Intel's chips perform limited to 4c/8t versus AMD's latest Ryzen chips let's just see where Intel would be if they didn't grudgingly glue sh*t together at 14nm+++++++++++++++ today because of AMD.
 
Last edited:
Joined
Apr 19, 2013
Messages
296 (0.07/day)
System Name Darkside
Processor R7 3700X
Motherboard Aorus Elite X570
Cooling Deepcool Gammaxx l240
Memory Thermaltake Toughram DDR4 3600MHz CL18
Video Card(s) Gigabyte RX Vega 64 Gaming OC
Storage ADATA & WD 500GB NVME PCIe 3.0, many WD Black 1-3TB HD
Display(s) Samsung C27JG5x
Case Thermaltake Level 20 XL
Audio Device(s) iFi xDSD / micro iTube2 / micro iCAN SE
Power Supply EVGA 750W G2
Mouse Corsair M65
Keyboard Corsair K70 LUX RGB
Benchmark Scores Not sure, don't care
Watching this intently.



Did a build for a friend recently, they wanted to go Intel and the 10850K OCing very easily @5.3 All core on 10 cores was a hell of an incentive to switch back to blue.

Does your friend pay their own power bill, because at that clock speed the CPU is pulling well over 300W! And at that speed, what does it REALLY do for his gaming experience? Gaming @ 1440P/4K 60Hz I saw little to no performance difference between my old i7 3770K and my new 3700X, despite the 4x the benchmark scores.
 
Joined
Jun 13, 2012
Messages
1,328 (0.31/day)
Processor i7-13700k
Motherboard Asus Tuf Gaming z790-plus
Cooling Coolermaster Hyper 212 RGB
Memory Corsair Vengeance RGB 32GB DDR5 7000mhz
Video Card(s) Asus Dual Geforce RTX 4070 Super ( 2800mhz @ 1.0volt, ~60mhz overlock -.1volts. 180-190watt draw)
Storage 1x Samsung 980 Pro PCIe4 NVme, 2x Samsung 1tb 850evo SSD, 3x WD drives, 2 seagate
Display(s) Acer Predator XB273u 27inch IPS G-Sync 165hz
Power Supply Corsair RMx Series RM850x (OCZ Z series PSU retired after 13 years of service)
Mouse Logitech G502 hero
Keyboard Logitech G710+
Yeah it's a odd discrepancy at first glance 16GB vs 32GB. It would seem that TUP APISAK might've chosen that comparison to show AMD's performance with a higher density module in play to not only highlight the higher performance of the AMD chip, but also glean into memory latency playing a role with it. The highest density ram modules often require looser latency which could what is being represented here. If the performance advantages on the new Ryzen chip being portrayed here is coming from the larger ram density that would be the worst case scenario and a bit unlikely, but with a limited amount of benchmarks to compare between both chips paired with that GPU module could perhaps be the case. This could simply be the closest comparison that could be compared at present by the leaker tough to say.
That assumes they wouldn't use most expensive ram for their side and cheapest brand for the other. AMD has in the history pull shenanigans with their benchmark releases so i would say this isn't outside the realm of possible to happen. The benchmark doesn't tell us what timings used and mhz the ram is running at so.
 
Joined
Mar 21, 2016
Messages
2,198 (0.74/day)
That assumes they wouldn't use most expensive ram for their side and cheapest brand for the other. AMD has in the history pull shenanigans with their benchmark releases so i would say this isn't outside the realm of possible to happen. The benchmark doesn't tell us what timings used and mhz the ram is running at so.
It's a unofficial benchmark comparison it really doesn't matter at this point and pricing between both could change at any point between now and launch. I get what you're alluding to and yeah obviously memory latency and density can skew perceptions and AMD has pulled shenanigans as has Intel and Nvidia. It's a common industry trend they all do it. Wait til things are verified and the dust settles. I'm sure I'll be satisfied with Zen 3 to be honest it certainly can't be any worse than Zen 2 which itself isn't bad.
 
D

Deleted member 185088

Guest
Just because a piece of software runs better on one CPU doesn't mean it's optimized for it, it could be that a the hardware just handles the workload better due to resource balancing and advantages of that architecture, advantages which usually are hard or impossible to exploit directly from software.
You can target the strengths of one architure and the program will run faster on it.
This guy made different workloads and run them on a Phenom and i7 8th gen, even though the phenom is so ol it's still faster in some:
I find it hard to believe that game engines don't do that at least to some extent.
 
Joined
Jan 27, 2015
Messages
1,649 (0.49/day)
System Name Legion
Processor i7-12700KF
Motherboard Asus Z690-Plus TUF Gaming WiFi D5
Cooling Arctic Liquid Freezer 2 240mm AIO
Memory PNY MAKO DDR5-6000 C36-36-36-76
Video Card(s) PowerColor Hellhound 6700 XT 12GB
Storage WD SN770 512GB m.2, Samsung 980 Pro m.2 2TB
Display(s) Acer K272HUL 1440p / 34" MSI MAG341CQ 3440x1440
Case Montech Air X
Power Supply Corsair CX750M
Mouse Logitech MX Anywhere 25
Keyboard Logitech MX Keys
Software Lots
You can target the strengths of one architure and the program will run faster on it.
This guy made different workloads and run them on a Phenom and i7 8th gen, even though the phenom is so ol it's still faster in some:
I find it hard to believe that game engines don't do that at least to some extent.

Yep, and it's also possible to do the reverse - design hardware to run specific instructions or even a specific sequence of instructions very quickly. You could target your CPU to a use case where you have multiple threads doing the exact same thing to different parts of a large data set where said threads did not need to interact with each others data set much.

For example, Cinebench.
 
Joined
Jun 13, 2012
Messages
1,328 (0.31/day)
Processor i7-13700k
Motherboard Asus Tuf Gaming z790-plus
Cooling Coolermaster Hyper 212 RGB
Memory Corsair Vengeance RGB 32GB DDR5 7000mhz
Video Card(s) Asus Dual Geforce RTX 4070 Super ( 2800mhz @ 1.0volt, ~60mhz overlock -.1volts. 180-190watt draw)
Storage 1x Samsung 980 Pro PCIe4 NVme, 2x Samsung 1tb 850evo SSD, 3x WD drives, 2 seagate
Display(s) Acer Predator XB273u 27inch IPS G-Sync 165hz
Power Supply Corsair RMx Series RM850x (OCZ Z series PSU retired after 13 years of service)
Mouse Logitech G502 hero
Keyboard Logitech G710+
Yep, and it's also possible to do the reverse - design hardware to run specific instructions or even a specific sequence of instructions very quickly. You could target your CPU to a use case where you have multiple threads doing the exact same thing to different parts of a large data set where said threads did not need to interact with each others data set much.

For example, Cinebench.
The game they used had direct AMD funding for a lot of it and when you look at player charts that game only gets 60-70 players avg so really not good metric to use a game no one plays. As other guy said you can code things for a certain cpu and get great results. Apple used to do that same thing back when they used PowerPC processors to make them look better then PC x86 machines.
 
Joined
Apr 30, 2020
Messages
855 (0.59/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 16Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
Joined
Jan 27, 2015
Messages
1,649 (0.49/day)
System Name Legion
Processor i7-12700KF
Motherboard Asus Z690-Plus TUF Gaming WiFi D5
Cooling Arctic Liquid Freezer 2 240mm AIO
Memory PNY MAKO DDR5-6000 C36-36-36-76
Video Card(s) PowerColor Hellhound 6700 XT 12GB
Storage WD SN770 512GB m.2, Samsung 980 Pro m.2 2TB
Display(s) Acer K272HUL 1440p / 34" MSI MAG341CQ 3440x1440
Case Montech Air X
Power Supply Corsair CX750M
Mouse Logitech MX Anywhere 25
Keyboard Logitech MX Keys
Software Lots
The game they used had direct AMD funding for a lot of it and when you look at player charts that game only gets 60-70 players avg so really not good metric to use a game no one plays. As other guy said you can code things for a certain cpu and get great results. Apple used to do that same thing back when they used PowerPC processors to make them look better then PC x86 machines.

I know, I agree 100%. For people who know the history of Ashes (I was one of the pre-release buyers) it is one of the most suspect benchmarks. What was particularly embarrassing for AMD in regards to Ashes was how despite their partnership in creating the game and the use of the AMD Vulkan API, when Pascal (10xx series) came out they got obliterated in Ashes anyway.

Looking beyond the surface and clickbait article titles of this "leak" - if Zen 3 is so good, why is an AMD co-sponsored title being used at 4k for pre-release hype and still losing in actual FPS 2/3 of the time? And why are both recent leaks - one on 5700U a week ago and now this one on 5800X - for that *same* AMD sponsored title which very few play regularly? Why not use something a bit more mainstream at settings that don't go GPU limited? Hmmm.....
 
Joined
Mar 21, 2016
Messages
2,198 (0.74/day)
Parallel single threading seems entirely plausible phase the clock skew peaks and dips on two chips and synchronize oscillation switching between one and the other. You should get 100% increase in performance with two chips like that in theory, but clock skew frequency oscillation is always in constant motion so you move from peaks to dips so with the switching in mind to maximize both you end up 50% in the best case scenario though synchronizing and sequencing it might not be 100% perfect so could be closer to 48%. I don't know if they can execute it perfectly in practice, but in theory it's defiantly within the scope of possibilities. You can actually mimic that with a pair of music sequencers it's functionally possible.

I mentioned the concept of it in the Intel bigLITTLE TPU thread not that far back you can basically manipulate clock skew or cycle duties in a clever manner in theory to get more performance by manipulating it in a similar fashion to what was done with by MOS Technology with the SID chip for the arpeggio's to simulate playing chords with polyphony it was a clever hardware trick at the time. It seems far fetched and somewhat unimaginable to actually be applied, but innovation always is you have to think outside the box or you'll always been stuck in a box.


This is a quadruple LFO what is allegedly being done is twin LFO if you look at the intersection points that's half a cycle duty rising and falling voltages/frequencies. If you look at the blue and green or yellow and purple they intersect perfectly. What's being done is a switching at the intersection cross section so you've got two valley peaks closer together and the base of the mountain so to speak isn't as far downward. That's assuming this is in fact being done and put into practice by AMD. I see it within the oscilloscope of possibilities for certain. That's basically what DDR memory did in practice. Big question is if they can pull it off within the dynamic complexity of software. Then again why can't they!!? Can't see what they can't divert it like a rail road track at that crossroad intersection point. That nets you a roughly 50% performance gain with 4 chips the valley dips would be reduce more and the peaks would happen more routinely and you'd end up with 100% more performance I think that's what DDR5 is suppose to do actually on the data rate hence the phrase quad data rate.
1601437374709.png


Thinking about it further I really don't see a problem with the I/O die managing that type of load switching in real time quickly and the data would already be present in the CPU memory it's not like it gets flushed instantly. Yeah maybe it could become a bit of a materialized reality. If not now certainly later. I have to think AMD will incorporate a I/O for the GPU soon as well if they want to pursue multi-chip GPU's.
 
Last edited:
Joined
Jul 4, 2018
Messages
245 (0.12/day)
The game they used had direct AMD funding for a lot of it and when you look at player charts that game only gets 60-70 players avg so really not good metric to use a game no one plays. As other guy said you can code things for a certain cpu and get great results. Apple used to do that same thing back when they used PowerPC processors to make them look better then PC x86 machines.
Yes but if you compare the 5800X with the "old" 3800X, it is still a big improvement...
Crazy 4K BatchRyzen 7 5800XRyzen 7 3800XCore i9-10900K
Normal167fps125fps136fps
Medium135fps111fps119fps
Heavy110fps87fps96fps
 
Joined
Jan 27, 2015
Messages
1,649 (0.49/day)
System Name Legion
Processor i7-12700KF
Motherboard Asus Z690-Plus TUF Gaming WiFi D5
Cooling Arctic Liquid Freezer 2 240mm AIO
Memory PNY MAKO DDR5-6000 C36-36-36-76
Video Card(s) PowerColor Hellhound 6700 XT 12GB
Storage WD SN770 512GB m.2, Samsung 980 Pro m.2 2TB
Display(s) Acer K272HUL 1440p / 34" MSI MAG341CQ 3440x1440
Case Montech Air X
Power Supply Corsair CX750M
Mouse Logitech MX Anywhere 25
Keyboard Logitech MX Keys
Software Lots
Yes but if you compare the 5800X with the "old" 3800X, it is still a big improvement...
Crazy 4K BatchRyzen 7 5800XRyzen 7 3800XCore i9-10900K
Normal167fps125fps136fps
Medium135fps111fps119fps
Heavy110fps87fps96fps


That should be "CPU Framerate" not "FPS".

If this were a car, what you are doing would be like calculating 0-60 time based on engine HP and car weight, while ignoring *actual* 0-60 time. No one does that. In real FPS Ryzen 3800X loses all 3 and 5800X loses 2 out of 3.
 
Joined
Jun 13, 2012
Messages
1,328 (0.31/day)
Processor i7-13700k
Motherboard Asus Tuf Gaming z790-plus
Cooling Coolermaster Hyper 212 RGB
Memory Corsair Vengeance RGB 32GB DDR5 7000mhz
Video Card(s) Asus Dual Geforce RTX 4070 Super ( 2800mhz @ 1.0volt, ~60mhz overlock -.1volts. 180-190watt draw)
Storage 1x Samsung 980 Pro PCIe4 NVme, 2x Samsung 1tb 850evo SSD, 3x WD drives, 2 seagate
Display(s) Acer Predator XB273u 27inch IPS G-Sync 165hz
Power Supply Corsair RMx Series RM850x (OCZ Z series PSU retired after 13 years of service)
Mouse Logitech G502 hero
Keyboard Logitech G710+
That should be "CPU Framerate" not "FPS".

If this were a car, what you are doing would be like calculating 0-60 time based on engine HP and car weight, while ignoring *actual* 0-60 time. No one does that. In real FPS Ryzen 3800X loses all 3 and 5800X loses 2 out of 3.
Even all those numbers say amd is faster but then you look at "avg (all batches)" Intel win's cpu frame rate still and has a 5900 score vs 5800 of amd IN a benchmark that is known to favor AMD. So to me those numbers in general mean NOTHING. They need to get a Benchmark that isn't slanted instead of a game that is pretty much a glorfied tech demo for their hardware.
 
Joined
Apr 30, 2020
Messages
855 (0.59/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 16Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
That should be "CPU Framerate" not "FPS".

If this were a car, what you are doing would be like calculating 0-60 time based on engine HP and car weight, while ignoring *actual* 0-60 time. No one does that. In real FPS Ryzen 3800X loses all 3 and 5800X loses 2 out of 3.

They already do that for cars when they build them it's called estimated 0-60 times, and they do it with computers simulations.

Hell some cars are so fast they don't even do 0-60 mph anymore they do 0-100mph.

Even all those numbers say amd is faster but then you look at "avg (all batches)" Intel win's cpu frame rate still and has a 5900 score vs 5800 of amd IN a benchmark that is known to favor AMD. So to me those numbers in general mean NOTHING. They need to get a Benchmark that isn't slanted instead of a game that is pretty much a glorfied tech demo for their hardware.

How is it AMD glorified if intel is winning ?

Where do you see 5900x ?, this 5800X 8 core 16 thread vs 10 core 20 thread.

The game is suppose to be really good at using multi thread it even shows the Threadripper 3960x is quite good on it
 
Joined
Jun 10, 2014
Messages
2,902 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
You can target the strengths of one architure and the program will run faster on it.
This guy made different workloads and run them on a Phenom and i7 8th gen, even though the phenom is so ol it's still faster in some:
I find it hard to believe that game engines don't do that at least to some extent.
Sure, down to single instructions can be slightly faster or slower on various architectures. In my tests, I've seen some cases where Haswell is slower than Sandy Bridge, but in most cases it's faster. The problem here is that this is a benchmark of a single operation in a loop, this is a synthetic test case which will exaggerate the real world difference. The reason why he runs the loop 1.000.000.000.000 times is to get a measurable difference. Also, it's not like these operations are different alternatives to solve the same problem. It's not unlikely that you can find older architectures which can do certain simple operations like this faster, while modern architectures are optimized for saturating several execution ports and doing a mix of various types of operations. This is why such benchmarks can be very misguiding.

When doing real optimization of code, it's common to benchmark whole algorithms or larger pieces of code to see the real world difference of different approaches. It's very rare that you'll find a larger piece of code that performs much better on Skylake and a competing alternative which performs much better on let's say Zen 2. Any difference that you'll find for single instructions will be less important than the overall improvements of the architecture. And it's not like there will be an "Intel optimization", Intel has changed the resource balancing for every new architecture, so has AMD.

Interestingly the sample code in that video scales poorly with many cores, but should be able to scale nearly linearly if the work queue is implemented smarter.

Parallel single threading seems entirely plausible phase the clock skew peaks and dips on two chips and synchronize oscillation switching between one and the other. <snip>
Instruction level parallelism is already heavily used, there is no need to spread the ALUs, FPUs, etc. across several cores, the distance would make a synchronization nightmare. We should expect future architectures to continue to scale their superscalar abilities. But I don't doubt that someone will find a clever way to utilize "idle transistors" in some of these by manipulating clock cycles etc.

The problem with superscalar scaling is keeping execution units fed. Both Intel and AMD currently have four integer pipelines. Integer pipelines are cheap (both in transistors and power usage), so why not double or quadruple them? Because they would struggle to utilize them properly. Both of them have been increasing instruction windows with every generation to try to exploit more parallelism, and Intel's next gen Sapphire Rapids/Golde Cove is allegedly featuring a massive 800 entry instruction window (Skylake has 224, Sunny Cove 352 for comparison). And even with these massive CPU front-ends, execution units are generally under-utilized due to branch mispredictions and cache misses. Sooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.
 
Last edited:
Joined
Apr 24, 2020
Messages
2,563 (1.75/day)
Sooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.

I'm not sure how much a compiler can help:

Code:
if(blah()){
    foo();
} else {
    bar();
}

The above is the easy case. There's lots of pattern matching and heuristics that help the pipelines figure out if foo() needs to be shoved into the pipelines, or if bar() needs to be shoved into the pipelines (while calculating blah() in parallel).

Now consider the following instead:

Code:
for(int i=0; i<array.size(); i++){
    array[i]->virtualFunctionCall();
}

You simply can't "branch predict" the virtualFunctionCall() much better than what we're doing today. Today, there are ~4 or 5 histories stored into the Branch Target Buffer (BTB), so the most common 3 or 4 classes will have their virtualFunctionCall() successfully branch-predicted without much issue. There are also 3 levels of branch predictor pattern-matchers running in parallel, giving the CPU three different branch targets (L1 branch predictor is fastest but least accurate. L3 branch predictor is most accurate but almost the slowest: only slightly faster than a mispredicted branch).

This demonstrates the superiority of runtime information (if there's only 2 or 3 classes in the array[], the CPU will branch predict the virtualFunctionCall() pretty well). The compiler cannot make any assumptions about the contents of array.

---------


By the way: most "small branches" are compiled into CMOV sequences on x86, no branch at all.

--------------

The only things being done grossly different seem to be the GPU architectures, which favor no branch prediction at all, and instead just focus on wider-and-wider SMT to fill their pipelines (and non-uniform branches are very, very inefficient because of thread divergence. Uniform branches are efficient on both CPUs and GPUs, because CPUs will branch-predict a uniform branch while GPUs will not have any divergence). Throughput vs Latency strikes again: GPUs can optimize throughput but CPUs must optimize latency to be competitive.
 
Last edited:
Joined
Jun 10, 2014
Messages
2,902 (0.80/day)
Processor AMD Ryzen 9 5900X ||| Intel Core i7-3930K
Motherboard ASUS ProArt B550-CREATOR ||| Asus P9X79 WS
Cooling Noctua NH-U14S ||| Be Quiet Pure Rock
Memory Crucial 2 x 16 GB 3200 MHz ||| Corsair 8 x 8 GB 1333 MHz
Video Card(s) MSI GTX 1060 3GB ||| MSI GTX 680 4GB
Storage Samsung 970 PRO 512 GB + 1 TB ||| Intel 545s 512 GB + 256 GB
Display(s) Asus ROG Swift PG278QR 27" ||| Eizo EV2416W 24"
Case Fractal Design Define 7 XL x 2
Audio Device(s) Cambridge Audio DacMagic Plus
Power Supply Seasonic Focus PX-850 x 2
Mouse Razer Abyssus
Keyboard CM Storm QuickFire XT
Software Ubuntu
You simply can't "branch predict" the virtualFunctionCall() much better than what we're doing today.
Of course not, you will never be able to do that, that's not what I meant.
I was thinking of branching logic inside a single scope, like a lot of ifs in a loop. Compilers already turn some of these into branchless alternatives, but I'm sure there is more potential here, especially if the ISA could express dependencies so the CPU could do things out of order more efficiently and hopefully some day limit the stalls in the CPU. As you know, with ever more superscalar CPUs, the relative cost of a cache miss or branch misprediction is growing.
Ideally code should be free of unnecessary branching, and there are a lot of clever tricks with and without AVX, which I believe we have discussed previously.

But about your virtual function calls. If your critical path is filled with virtual function calls and multiple levels of inheritance, you're pretty much screwed performance wise, no compiler will be able to untangle this at compile time. And in most cases (at least how most programmers use OOP), these function calls can't be statically analysed, inlined or dereferenced at compile time.
 
Joined
Jun 13, 2012
Messages
1,328 (0.31/day)
Processor i7-13700k
Motherboard Asus Tuf Gaming z790-plus
Cooling Coolermaster Hyper 212 RGB
Memory Corsair Vengeance RGB 32GB DDR5 7000mhz
Video Card(s) Asus Dual Geforce RTX 4070 Super ( 2800mhz @ 1.0volt, ~60mhz overlock -.1volts. 180-190watt draw)
Storage 1x Samsung 980 Pro PCIe4 NVme, 2x Samsung 1tb 850evo SSD, 3x WD drives, 2 seagate
Display(s) Acer Predator XB273u 27inch IPS G-Sync 165hz
Power Supply Corsair RMx Series RM850x (OCZ Z series PSU retired after 13 years of service)
Mouse Logitech G502 hero
Keyboard Logitech G710+
How is it AMD glorified if intel is winning ?
Look up history of the game, it was funded by AMD. it means it will Over perform on amd hardware vs what would happen in other games that aren't coded for 1 side.
Where do you see 5900x ?, this 5800X 8 core 16 thread vs 10 core 20 thread.
Read what i said i never said 5900x. Go back to OP images where it shows the 2 cpu's on right side with summary. There is 2 numbers that are Score that which intel cpu scored 5900 points and amd cpu scored 5800. How could amd win with higher fps but lower score?
 
Last edited:
Joined
Apr 30, 2020
Messages
855 (0.59/day)
System Name S.L.I + RTX research rig
Processor Ryzen 7 5800X 3D.
Motherboard MSI MEG ACE X570
Cooling Corsair H150i Cappellx
Memory Corsair Vengeance pro RGB 3200mhz 16Gbs
Video Card(s) 2x Dell RTX 2080 Ti in S.L.I
Storage Western digital Sata 6.0 SDD 500gb + fanxiang S660 4TB PCIe 4.0 NVMe M.2
Display(s) HP X24i
Case Corsair 7000D Airflow
Power Supply EVGA G+1600watts
Mouse Corsair Scimitar
Keyboard Cosair K55 Pro RGB
@ arbiter oh I missed that, because everyone was comparing Cpu frame rates.

@ efikkan I kept hearing him talk about switching in that video. I remember somethings about that is why AMD multi threading always ended feeling more responsive then Intel. It was something about Hitting ALT tab in windows while gaming, it just seems to be quicker at odd stuff like that.

@dragontammer5877 There are some benches that show there is some bottleneck with zen 2. Everyone says it's it's infinity fabric. The best way to get around the Infinity fabric bottleneck would be to add another link. If's it's only one link, because sometimes you got that lowly 3300X getting up in-between things like the 3900x and 3950x. We know that is usually, because it's a single CCX. Then again If the 3900x is ahead that would put it down to it having a larger cache ratio to cores.
 
Joined
Mar 21, 2016
Messages
2,198 (0.74/day)
Sure, down to single instructions can be slightly faster or slower on various architectures. In my tests, I've seen some cases where Haswell is slower than Sandy Bridge, but in most cases it's faster. The problem here is that this is a benchmark of a single operation in a loop, this is a synthetic test case which will exaggerate the real world difference. The reason why he runs the loop 1.000.000.000.000 times is to get a measurable difference. Also, it's not like these operations are different alternatives to solve the same problem. It's not unlikely that you can find older architectures which can do certain simple operations like this faster, while modern architectures are optimized for saturating several execution ports and doing a mix of various types of operations. This is why such benchmarks can be very misguiding.

When doing real optimization of code, it's common to benchmark whole algorithms or larger pieces of code to see the real world difference of different approaches. It's very rare that you'll find a larger piece of code that performs much better on Skylake and a competing alternative which performs much better on let's say Zen 2. Any difference that you'll find for single instructions will be less important than the overall improvements of the architecture. And it's not like there will be an "Intel optimization", Intel has changed the resource balancing for every new architecture, so has AMD.

Interestingly the sample code in that video scales poorly with many cores, but should be able to scale nearly linearly if the work queue is implemented smarter.


Instruction level parallelism is already heavily used, there is no need to spread the ALUs, FPUs, etc. across several cores, the distance would make a synchronization nightmare. We should expect future architectures to continue to scale their superscalar abilities. But I don't doubt that someone will find a clever way to utilize "idle transistors" in some of these by manipulating clock cycles etc.

The problem with superscalar scaling is keeping execution units fed. Both Intel and AMD currently have four integer pipelines. Integer pipelines are cheap (both in transistors and power usage), so why not double or quadruple them? Because they would struggle to utilize them properly. Both of them have been increasing instruction windows with every generation to try to exploit more parallelism, and Intel's next gen Sapphire Rapids/Golde Cove is allegedly featuring a massive 800 entry instruction window (Skylake has 224, Sunny Cove 352 for comparison). And even with these massive CPU front-ends, execution units are generally under-utilized due to branch mispredictions and cache misses. Sooner or later the ISA needs to improve to help the CPU, which should be theoretically possible, as the compiler has much more context than is passed on through the x86 ISA, as well as eliminating more branching.
Couldn't AMD take chip dies and use the I/O die modulate them much like system memory for double data rate or quadruple data rate to speed up single thread performance. They'd each retain their own cache so that itself is a perk of modulating between them in synchronized way controlled thru the I/O die to complete single thread task load. For all intents and purposes the CPU would behave as if it's a single faster chip. It could basically fill the L1 cache on one then swap to the next die and same with the L2 and L3 caches. In fact they synchronize each much like numerous latency timings. On top of that if you need multi-thread performance it could have some type of first serve access priority possibly based on condition criteria. It could be a bit like the windows setting for foreground/background tasks with time slices between single thread performance and multi-threaded performance that the I/O die manages and takes advantage of when it really need the multi-threaded performance.

The cache misses defiantly are harsh when they happen, but wouldn't automatically cycle modulating the individual L1/L2/L3 caches in different chip dies through the I/O die get around that? Cycle between the ones available basically. Perhaps they only do it with larger L2/L3 cache's though I mean maybe it doesn't make enough practical sense with the L1 cache being so small and switch times and such. Perhaps in a future design at some level or another I don't know.

Something else on the I/O die doing modulation switching between cores or die's at the core level in particular they could it based on poll chips and which ever can precision boosts the highest select that one for the single thread performance then poll it again after a set period and select whichever core gave the best results again and keep doing that approach. Basically no matter what it could always try to select the highest boost speed to optimize the single thread performance. Perhaps it does that between cores and die's as well so if one gets a little hot let it cool off while making use of the coolest die though switching between those might be less intermittent.
 
Top