Update Nov 19: We posted a follow up to this article
In this article, we will share with you a custom power plan for Windows that should have a significant impact on your 3rd generation Ryzen processor's boosting behavior, ability to leverage favored cores even without the yet-to-be-released Windows 10 19H2, yield higher boost frequencies beyond even what AGESA 18.104.22.168ABBA achieves, and lower latencies in getting the processor to respond to workloads.
AMD's 3rd generation Ryzen processors are the most advanced desktop processors in the market today, and even without any tweaks or fixes the fastest mainstream desktop chips you can buy. The 3rd generation Ryzen processors come with an advanced and elaborate on-die power-management solution that can interact with the operating system far more frequently than older generation processors, to figure out application load and accordingly adjust the power and clock speeds of its cores.
I have written other articles for TechPowerUp in the past, related to Ryzen memory architecture and optimization. Today, I bring you the "1usmus Custom Power Plan" for 3rd generation Ryzen processors. This is a modification of the way the processor and Windows talk to each other about performance demands called CPPC (short for Collaborative Processor Performance Control), resulting in what I believe to be Precision Boost the way AMD intended: snappy and responsive, but with the ability to tell legitimate workloads from background noise. This power plan should especially benefit users of Ryzen 9 series chips, such as the Ryzen 9 3900X and upcoming Ryzen 9 3950X, and 3rd generation Ryzen Threadripper processors on the sTRX4 socket.
I'd also like to thank Oleg Kasumov and @Kromaatikse for their help with these discoveries.
Technical BackgroundUnlike benchmarking applications, which spawn a bunch of identical threads, running identical code on various pieces of data, modern games are very heterogeneous. Each thread runs its own code that is completely different from other threads and works with data in varying quantities, generating loads that vary between threads. Data produced by one thread is often consumed by another, which introduces delays and might even pass its data on to another waiting thread. The concept of a "thread pool" also exists, where each worker thread picks up any job, of any type, working on any data—whatever is ready to run. This means the flow of data is downright chaotic, which generates a lot of inter-CCX traffic if some threads are on one CCX and some on the other.
This behavior is further amplified by modern graphics APIs such as DirectX 12 and Vulkan, which encourage submission of rendering commands in a multi-threaded manner. Perhaps you've noticed how some titles see a performance penalty on Ryzen (compared to Intel) when the newer Vulkan or DX12 API is used. Windows loves to balance the CPU load across multiple CPU cores, moving threads from busy cores to idle ones. This is normal, expected behavior for a modern SMP-aware process scheduler, but Windows is actually pretty dumb about it.
Windows considers a core as "busy" even if there is only a single thread using it and moves that same thread to an idle core if one is available! Furthermore, the Windows process scheduler makes no distinction whatsoever between physical and virtual cores, nor between CCXes with their separate caches. In comparatively recent versions of Windows (at least since Windows 7), this tendency towards migration is tamed by a "core parking" system. If a core is parked, the process scheduler doesn't migrate threads to it, allowing it to go into a deep idle state to save power. Additionally, the core-parking algorithm is responsible for keeping the second virtual core of each HT/SMT-capable physical core shut down unless needed, which maximizes performance-per-thread in light multi-threading scenarios.
Just to clarify: the Windows scheduler is not SMT aware—only the Windows core-parking algorithm is SMT aware. Why does this matter? Because in High Performance mode, the core-parking system is disabled. Every single core is unparked and therefore, the process scheduler merrily migrates threads across every single physical and virtual core on the system (unless all cores are busy anyway with, for example, a multi-threaded productivity workload). That means even a single-threaded workload ends up moving between CCXes, or even CCDs, and has to drag all the data that it's working on behind it, roughly every 40 milliseconds on average. In a game, multiply that by the number of effective threads the game runs. Not only that, but threads end up sharing a physical core much more often. Linux handles this rather better: it actively prefers to keep threads on the same core for as long as there are no scheduling conflicts on that core. So, a single-threaded workload on Linux will usually stay on the same core for several seconds at a time, if not longer. This not only avoids the overhead of migrating the thread, but also avoids cache misses and inter-CCX traffic that would result from such a migration. This behavior is not Ryzen-specific, but has been standard on all SMP/SMT/CMT machines running Linux for several years.
In the near future, Microsoft will release the Windows 10 19H2 update (Windows 10 1909), which gives the OS scheduler the ability to prioritize threads. I tested a pre-release build of this version and didn't notice significant improvements. Quite often, the scheduler used a higher priority for background processes. I think you can imagine what will happen if Windows gives precedence to such a process, and not your currently running game.
My approach to solve this deficit in the Windows Scheduler is to use a customized power profile that provides better guidance to the scheduler on how to distribute loads among cores. This should put load on the best cores (which clock higher) and result in higher and smoother FPS because workloads will no longer bounce between cores.
About Collaborative Power Performance Control 2 (CPPC2)The underlying issue is the rate at which "Zen 2" processors and the operating system talk to each other about required processor power. This is a staggering 1 ms compared to 15 ms on most old processors. When you have a large number of low-priority applications requesting CPU time, or certain kinds of monitoring utilities trying to measure processor stats, a "Zen 2" processor is under the false impression about the performance demand and perceives these utilities and software as load that warrants waking up the cores, and raising the frequencies.
With "Zen 2", AMD introduced a platform feature called UEFI CPPC2 (UEFI collaborative power and performance control). This capability puts the processor's on-die firmware in complete control of clock-speed/voltage selection at all times. The processor can make these changes at breakneck speed, once every 1 ms. Out of the box, Windows 10 cannot keep up with this speed as its stock "Balanced" power scheme messages the processor with performance-requirement requests only once every 15 ms. AMD hence installs a custom power scheme via the Chipset Driver called "Ryzen Balanced Power Plan". This makes the OS and processor talk to each other every 1 ms.
When not gaming or working, the average PC enthusiast's machine is running a web browser, chat application, peripherals application, and a monitoring app which polls hardware. The processor could interpret their load as a need for boost if the processor and OS are talking to each other at 1 ms. This gives users the appearance of a processor that's stuck at a higher state than idle.
AMD's solution to the problem is an updated Chipset Driver with an updated "Ryzen Balanced" profile that drops the rate at which the processor is receiving performance requests from the OS to 15 ms when the processor is idling, and to speed up that rate to 1 ms when faced with a "real" workload. The drivers also change the way the processor behaves at low workloads. Any core that isn't power-gated will sit at 99% of the processor's nominal clock speed. This 99% value keeps the active core on the verge of boosting, so that any "serious" workload can easily trigger CPU boost and send the processor back to the 1 ms polling rate.
AMD implemented this change, and more, using the new Chipset Driver version 1.07.29 available for download on the AMD website.
Hidden Power Plan SettingsWhile you are probably familiar with the power control panel in Microsoft Windows and have seen the options there before, there are several additional settings that can be configured through a power plan, but are hidden from user view.
- "Processor performance autonomous mode" controls whether autonomous mode is enabled on systems that implement CPPC2 and determines whether performance states are selected by the operating system, or whether the CPU cores can determine their target performance state themselves. On systems without CPPC2, this setting has no effect.
- "Processor autonomous activity window" specifies the value to program in the autonomous activity window register on systems that implement CPPC2 with autonomous mode. Longer values tell the platform that it should be less sensitive to short duration spikes/dips in processor utilization.
- "Processor energy performance preference policy for Processor Power Efficiency Class 1" specifies a fine-grained bias level (between 0 and 255) that controls how aggressively the processor should try to optimize for low power, instead of trying to keep performance as high as possible.
- "Heterogeneous short running thread scheduling policy" specifies what thread scheduling policy to use for short running threads on heterogeneous systems. Responsible for selecting how cores will be selected during single-threaded loads (quality or marking).
P-States and C-StatesTwo control mechanisms exist to decrease the power consumption of a processor.
- Power down subsystems: C-States
- Voltage/Frequency reduction: P-States
On the other hand, P-states execute switches to specific (power saving) states. The subsystem is actually running but does not require full performance, so the voltage and/or frequency it operates at is reduced. P-state x (or Px) means the subsystem it refers to (e.g., a CPU core) is operating at a specific (frequency and voltage) pair.
Because most modern CPUs have multiple cores in a single package, C-states are further divided into core C-states (CC-states) and package C-states (PC-states). The reason for PC-states is that there are other (shared) components in the processor that can also be powered down after all cores using them are powered down (e.g., the shared cache). However, as a user or programmer, we cannot control these since we do not interact with the package directly, but, rather, with the individual cores. We can then only affect the CC-states directly; PC-states are indirectly affected based on the CC-states of the cores.
The states are numbered, starting from zero, like C0, C1… and P0, P1... . The higher the number, the more power is saved. C0 means no power saving by shutting down something, so everything is powered on. P0 means maximum performance, thus maximum frequency, voltage, and power used.