This sadly is what i've been trying to explain to people about how the current intel systems work
They flip flop between PL1 and PL2, and have issues sustaining multi threaded performance unless you have a lot of cooling - you get the performance, but unless you meet certain criteria (unlocked power settings and very high end cooling) you cant sustain it.
There are definitely workarounds, your main choices would be lower PL1 and PL2 limits so the cooler has a chance to dissipate heat before it saturates completely, or an all core overclock at more efficient settings
(Ex: my 5800x runs at 4.4 (all core) - 5.05 (1-2T) at 75C, but a static 4.6GHz runs at 50C)
Also since intel platforms often have AVX offsets, make sure you're testing with AVX and non AVX programs and both sets of clocks are stable.
There's no use only testing a system for non AVX workloads, since games are using them too - you're either going to get lower clocks and performance than you expect, or instability by ignoring it.
I like your method mixed with mine.
"If you don't know already, find out which of your cores are weakest (using prime95 at a lower voltage and clock speed, one that doesn't raise temps above what your real world loads reach). Then, run Prime95 on those cores, adding load to the remaining cores so that your weak cores running Prime95 reach the maximum temperature they do in real-world scenarios. Whatever voltage is required to prevent errors, add an additional 0.02 to 0.03V to ensure stability.
edit: To add heat, google "CPU burn in", in the first result download the 20kB file, each instance is one thread. You can assign them to cores as needed using task manager"
Then, set a CPU power limit in the BIOS at 3-5 watts over the amount reached during the above stability test (the power dissipated in the (usually) most demanding program). Then, if something out of the ordinary happens, your system doesn't hang! (the extra 0.02-0.03V easily allows for the extra 3-5 watts dissipation)
edit: like a lot of people I do think running P95 on all cores is a stupid way to ensure stability for systems that don't run loads anywhere near as demanding, but P95 applied as above should be the standard method