core 2 quad numa affinity and scheduling

Manoa · Dec 29, 2017

in the beginning of core 2 cpu there were 2 cores running the same die and communication between the 2 cores was internal, in the core 2 quad there are 2 glued together duo dies completely separated from each other and their communication is through the FSB - that reduces performance and increases latency in addition to that the use of the FSB for inter-core communication reduces the available FSB resources for other FSB operations (PCI, RAM, SB etc). the OS (in this case Linux) is not aware of this so it cannot deal with this problem automatically, and neither does windows. this requires user intervention, to tell the system how to run it's processes in the most efficient way.

this means I need to control affinity of processes manually. what I am asking is if anyone has experience with this and advice on what strategy is best to use to run as fast as possible - such as:

1. run light loads on one core and heavy loads on the others
2. run light loads spread on all cores and the heavy loads with affinity
3. run light loads spread on same die and the heavy loads with affinity
4. group together to the same die all the processes that interact with each other and separate the processes that don't

Bill_Bright · Dec 29, 2017

Manoa said:
in the core 2 quad there are 2 glued together duo dies completely separated from each other and their communication is through the FSB - that reduces performance and increases latency in addition to that the use of the FSB for inter-core communication reduces the available FSB resources for other FSB operations (PCI, RAM, SB etc).

I don't believe that is correct.

First, they are not glued together. They are on the same die with a communications path measured in micrometers or even nanometers.

Second, communications between the two pairs does NOT occur over the same bus as communications with PCIe, RAM etc. That bus is on the motherboard and can be measured in inches.

Manoa said:
this means I need to control affinity of processes manually.

Ummm, no it doesn't.

The problem with your scenario is that it is no where near completely defined.

Setting CPU affinity manually forces Windows to use the core or core designed by that setting for that specific application. Windows will only run that app on that core (which may be busy) even if other cores are doing nothing.

If you just leave it alone, the OS will assign the application to run on the least busy core. That's a good thing. And modern operating systems know how to optimize those settings quite well.

I am not saying there is no reason to manually dink with these setting, but it is application specific and typically only advantageous with specfic older programs not designed for multi-core (or multi-CPU) systems.

Flaky · Dec 29, 2017

Manoa said:
(...) and neither does windows. this requires user intervention, to tell the system how to run it's processes in the most efficient way.

Could you provide any sources proving that Windows has no knowledge of how to handle c2q cpus?

Manoa · Dec 29, 2017

if I could provide sources proving OS mis-scheduling, I wouldn't be asking the question, because then and there I would be informed about the solution to the problem, in addition you should know that windows is closed source, nobody can prove whether the windows scheduler does or doesn't know (and if it does know, how it handles it) about the hardware relationship and architecture of the core 2 quad. my search on the internet shows no results that give information of the problem or the solution.

Vya Domus · Dec 29, 2017

Bill_Bright said:
I don't believe that is correct.

First, they are not glued together. They are on the same die with a communications path measured in micrometers or even nanometers.

Core 2 Quads were definitely made out of two separate dies and they most definitely communicate through the FSB as well :

Remember how AMD used to boast about the fact that their quad cores are true " quad cores" on just one die ?

To OP : don't bother , it doesn't make that much of a difference.

Manoa · Dec 29, 2017

ok, thank

I will just enable multicore processor support in the kernel options and be done with it

Flaky · Dec 29, 2017

Then please decide. Either:

Manoa said:
the OS (in this case Linux) is not aware of this so it cannot deal with this problem automatically, and neither does windows.

or
"I don't know".

Assuming you find some kind of "solution" - how are you going to measure the improvement?
How are you going to prove that "solution" actually works?
What would be the baseline for that measurement, without knowing how the OS handles the hardware?

Vya Domus · Dec 29, 2017

I would bet windows already has some sort of scheduling optimization for Core 2 processor.

Modern Ryzen CPUs are made in the exact same way using two dies and they run just fine.

natr0n · Dec 29, 2017

https://bitsum.com/

You might want to play with process lasso. I think this is what you're after.

Papahyooie · Dec 29, 2017

Modern ryzen CPUs are not made the same way. Yes they have two different dies, but they have a native interconnect between the dies. Core 2 quad made the cores talk to each other through the FSB. I thought this was common knowledge, at least among the "old" veterans. @Vya Domus is right though on the point that Windows (7+ at least afaik) already handles this as best it can. Setting affinity can help in certain circumstances, but if you're using this processor for everyday use, and don't have a very specific application in mind, there is no setting that you can just set and have it fix the issue. It has to be handled on an application specific basis.

Vya Domus · Dec 29, 2017

Papahyooie said:
Modern ryzen CPUs are not made the same way. Yes they have two different dies, but they have a native interconnect between the dies. Core 2 quad made the cores talk to each other through the FSB. I thought this was common knowledge, at least among the "old" veterans

It's same in the sense there are two dies that have a limited capability in terms of communication between them the same way the Core 2 Quad had and thus it's susceptible to the same theoretical problems/limitations.

Manoa · Dec 29, 2017

Papahyooie is going to be the E5450 core 2 quad modded with 771-to-775 and im going to run on it avisynth with x264 along with a couple other light apps, all on linux
my plan is to start 2 separate encoding/filtering instances of: avs2yuv + x264, each instance will run 2 deinterlacing threads. my thinking was that I should keep each of the instances "sticked" to a specific die (one instance to each die) since they are independent tasks - there should not be communication between them, and would avoid migrations between dies, on top of that I planned to distribute evenly the other 2 light loads across the 2 dies.
right now this box is running 2 instances with 1 deinterlacing threads each on an E8600, 3800 mhz, it gives 0.06 fps.
I made this thread to see if I can maximize utilization of the quad when I get it to get to 0.12 fps

agent_x007 · Dec 29, 2017

The only option when it would anything (affinity), is when app can use two cores max. and scales with cache.
Then switching it to use two cores on seperate dies would be a good idea.

Flaky · Dec 29, 2017

Papahyooie said:
Modern ryzen CPUs are not made the same way. Yes they have two different dies, but they have a native interconnect between the dies.

AM4 ryzen CPUs have only one die

Manoa · Dec 29, 2017

agent_x007 I could do that, tell the avisynth to do 4-thread deinterlacing and run only 1 instance, but there is a problem: cpu is never at 100% so I loose speed, the most ideal situation for avisynth is to run 1:1, one instance per cpu not 1 thread per cpu, so that is the problem. but on the other hand I can't do 1:1 because instances use 1.5GB RAM each, unlike threads so...

Gasaraki · Dec 29, 2017

Umm, people are getting their facts mixed up here.

-Core2Quads WERE 2x Core2Duos 'glued' together.
-They DO communicate over the FSB if cache data needs to be shared.
-Windows DOES know about Core2Quads and know how to do processor affinity properly.
-You claim that Windows doesn't know how to handle affinity for the Core2Quads yet you said that "windows is closed source so "nobody can prove whether the windows scheduler does or doesn't know" yet you are trying to fix this issue because you "know" that it doesn't work properly with Core2Quads...

_JP_ · Dec 29, 2017

The lack of notion for hardware abstraction layer, multi socket vs. multi-core and NUMA in this thread is baffling. Also, high performance in a 771-to-775 mod :wtf:

Just run the tasks and let the OS do its thing. Cutting through the management the OS should be doing with it's thread scheduler is probably why your performance is so low...that and the fact that it's a wolfdale/yorkfield core. It is old.

Manoa · Dec 29, 2017

Gasaraki all of windows by principle "doesn't work properly" just so you know, the worst OS in the world, people use it only because they are forced to use it, so how would you know what it does or doesn't, you are not the developer of it and neither is anyone else. im not here to judge you or anyone else for using windows, you want to believe you know things that nobody could possibly know, that's fine. but here is the thing that you are missing: this isn't about who knows what, this is about trust: an OS that suckx in many departments cannot be trusted to do anything right, just go and look at what windows 10 scheduler is doing to your processes and since some people already mentioned ryzen - is awful performance when ryzen is with windows 10. how dare you trying to prove windows is does or doesn't know when everything it does is plain and simple bullshit.

im not running windows if you didn't noticed, and now you know why.

Bill_Bright · Dec 29, 2017

The FSB is part of the motherboard and is used to establish communications between the CPU, RAM and graphics solution via PCIe. You are suggesting the two halves of this CPU communicate over that bus. I'm just saying they communicate directly over an on-die bus.

Manoa · Dec 29, 2017

don't have it yet _JP_ , but I was thinking to plan ahead, so you also say it best to leave it to the scheduler to deal with.

cheesy999 · Dec 29, 2017

Manoa said:
Gasaraki all of windows by principle "doesn't work properly" just so you know, the worst OS in the world

Hold on mate, that can't be true, there are OS out there that come with KDE as the default DE :laugh:

I'd be interested in before and after benches if you can

Manoa · Dec 29, 2017

UI preference is a subject of itself, I was more referring to the kernel aspect of the operating systems

yes I will do some testing when I get it, but so far my impression from people is that it doesn't have as much an impact on performance as I was afraid of

Papahyooie · Dec 29, 2017

Bill_Bright said:
The FSB is part of the motherboard and is used to establish communications between the CPU, RAM and graphics solution via PCIe. You are suggesting the two halves of this CPU communicate over that bus. I'm just saying they communicate directly over an on-die bus.

I can't provide a source at the moment, but I'm pretty sure that's not true. I have always been under the impression that C2Q dies talked over the FSB, which was one of the architecture's major drawbacks for multithreaded applicatons. QPI, the on-die interconnect, didn't come until the i7's first iteration. Like I said... I thought this was common knowledge, so I haven't really researched it in forever. May need to dig up some old articles and verify.

Flaky said:
AM4 ryzen CPUs have only one die

Perhaps I'm thinking of the Epyc line. Intel had made the accusation that they were "glued together"... Which was especially juicy, considering that Intel had essentially done the same thing with C2Q a few years back. Regardless, the huge difference being that C2Q's multiple dies didn't speak together with an on-package interconnect (if the above is indeed true... I'll see if I can verify) whereas AMD's modern lines do have an on-die interconnect (as well as modern intel does.)

R-T-B · Dec 30, 2017

Vya Domus said:
Core 2 Quads were definitely made out of two separate dies and they most definitely communicate through the FSB as well :

View attachment 95314

Remember how AMD used to boast about the fact that their quad cores are true " quad cores" on just one die ?

To OP : don't bother , it doesn't make that much of a difference.

It actually makes no difference.

Why?

NUMA is for CPUs that have seperate memory busses, to indicate what CPU core has direct access to what portion of memory. C2Q does not have NUMA, they both share the same FSB link to the same northbridge to the same memory controller. NUMA helps nothing here, as the memory access is uniform.

Ryzen is in a similar boat. EPYC and Threadripper are not.

Manoa · Dec 30, 2017

I was afraid not from memory access penalty which as you say is the same, but from penalty associated with inter-die communication such as: process migration from one die to another, inter-process communication between several processes across the separate dies (which in my case would be avs2yuv.exe sending it's finished frames from one die, while the encoder process x264 is running on another)

System Name	Lightning
Processor	4790K
Motherboard	asrock z87 extreme 3
Cooling	hwlabs black ice 20 fpi radiator, cpu mosfet blocks, MCW60 cpu block, full cover on 780Ti's
Memory	corsair dominator platinum 2400C10, 32 giga, DDR3
Video Card(s)	2x780Ti
Storage	intel S3700 400GB, samsung 850 pro 120 GB, a cheep intel MLC 120GB, an another even cheeper 120GB
Display(s)	eizo foris fg2421
Case	700D
Audio Device(s)	ESI Juli@
Power Supply	seasonic platinum 1000
Mouse	mx518
Software	Lightning v2.0a

System Name	Brightworks Systems BWS-6 E-IV
Processor	Intel Core i5-6600 @ 3.9GHz
Motherboard	Gigabyte GA-Z170-HD3 Rev 1.0
Cooling	Quality Fractal Design Define R4 case, 2 x FD 140mm fans, CM Hyper 212 EVO HSF
Memory	32GB (4 x 8GB) DDR4 3000 Corsair Vengeance
Video Card(s)	EVGA GEForce GTX 1050Ti 4Gb GDDR5
Storage	Samsung 850 Pro 256GB SSD, Samsung 860 Evo 500GB SSD
Display(s)	Samsung S24E650BW LED x 2
Case	Fractal Design Define R4
Power Supply	EVGA Supernova 550W G2 Gold
Mouse	Logitech M190
Keyboard	Microsoft Wireless Comfort 5050
Software	W10 Pro 64-bit

System Name	Lightning
Processor	4790K
Motherboard	asrock z87 extreme 3
Cooling	hwlabs black ice 20 fpi radiator, cpu mosfet blocks, MCW60 cpu block, full cover on 780Ti's
Memory	corsair dominator platinum 2400C10, 32 giga, DDR3
Video Card(s)	2x780Ti
Storage	intel S3700 400GB, samsung 850 pro 120 GB, a cheep intel MLC 120GB, an another even cheeper 120GB
Display(s)	eizo foris fg2421
Case	700D
Audio Device(s)	ESI Juli@
Power Supply	seasonic platinum 1000
Mouse	mx518
Software	Lightning v2.0a

System Name	Good enough
Processor	AMD Ryzen R9 7900 - Alphacool Eisblock XPX Aurora Edge
Motherboard	ASRock B650 Pro RS
Cooling	2x 360mm NexXxoS ST30 X-Flow, 1x 360mm NexXxoS ST30, 1x 240mm NexXxoS ST30
Memory	32GB - FURY Beast RGB 5600 Mhz
Video Card(s)	Sapphire RX 7900 XT - Alphacool Eisblock Aurora
Storage	1x Kingston KC3000 1TB 1x Kingston A2000 1TB, 1x Samsung 850 EVO 250GB , 1x Samsung 860 EVO 500GB
Display(s)	LG UltraGear 32GN650-B + 4K Samsung TV
Case	Phanteks NV7
Power Supply	GPS-750C

System Name	Lightning
Processor	4790K
Motherboard	asrock z87 extreme 3
Cooling	hwlabs black ice 20 fpi radiator, cpu mosfet blocks, MCW60 cpu block, full cover on 780Ti's
Memory	corsair dominator platinum 2400C10, 32 giga, DDR3
Video Card(s)	2x780Ti
Storage	intel S3700 400GB, samsung 850 pro 120 GB, a cheep intel MLC 120GB, an another even cheeper 120GB
Display(s)	eizo foris fg2421
Case	700D
Audio Device(s)	ESI Juli@
Power Supply	seasonic platinum 1000
Mouse	mx518
Software	Lightning v2.0a

System Name	natr0n-PC
Processor	Ryzen 5950x-5600x \| 9600k
Motherboard	B450 AORUS M \| Z390 UD
Cooling	EK AIO 360 - 6 fan action \| AIO
Memory	Patriot - Viper Steel DDR4 (B-Die)(4x8GB) \| Samsung DDR4 (4x8GB)
Video Card(s)	EVGA 3070ti FTW
Storage	Various
Display(s)	Pixio PX279 Prime
Case	Thermaltake Level 20 VT \| Black bench
Audio Device(s)	LOXJIE D10 + Kinter Amp + 6 Bookshelf Speakers Sony+JVC+Sony
Power Supply	Super Flower Leadex III ARGB 80+ Gold 650W \| EVGA 700 Gold
Software	XP/7/8.1/10
Benchmark Scores	http://valid.x86.fr/79kuh6

System Name	Gamer
Processor	AMD Ryzen 3700x
Motherboard	AsRock B550 Phantom Gaming ITX/AX
Memory	32GB
Video Card(s)	ASRock Radeon RX 6800 XT Phantom Gaming D
Case	Phanteks Eclipse P200A D-RGB
Power Supply	800w CM
Mouse	Corsair M65 Pro
Software	Windows 10 Pro

System Name	BOX
Processor	Core i7 6950X @ 4,26GHz (1,28V)
Motherboard	X99 SOC Champion (BIOS F23c + bifurcation mod)
Cooling	Thermalright Venomous-X + 2x Delta 38mm PWM (Push-Pull)
Memory	Patriot Viper Steel 4000MHz CL16 4x8GB (@3240MHz CL12.12.12.24 CR2T @ 1,48V)
Video Card(s)	Titan V (~1650MHz @ 0.77V, HBM2 1GHz, Forced P2 state [OFF])
Storage	WD SN850X 2TB + Samsung EVO 2TB (SATA) + Seagate Exos X20 20TB (4Kn mode)
Display(s)	LG 27GP950-B
Case	Fractal Design Meshify 2 XL
Audio Device(s)	Motu M4 (audio interface) + ATH-A900Z + Behringer C-1
Power Supply	Seasonic X-760 (760W)
Mouse	Logitech RX-250
Keyboard	HP KB-9970
Software	Windows 10 Pro x64

System Name	Unimatrix
Processor	Intel i9-9900K @ 5.0GHz
Motherboard	ASRock x390 Taichi Ultimate
Cooling	Custom Loop
Memory	32GB GSkill TridentZ RGB DDR4 @ 3400MHz 14-14-14-32
Video Card(s)	EVGA 2080 with Heatkiller Water Block
Storage	2x Samsung 960 Pro 512GB M.2 SSD in RAID 0, 1x WD Blue 1TB M.2 SSD
Display(s)	Alienware 34" Ultrawide 3440x1440
Case	CoolerMaster P500M Mesh
Power Supply	Seasonic Prime Titanium 850W
Keyboard	Corsair K75
Benchmark Scores	Really Really High

System Name	LenovoⓇ ThinkPad™ T430
Processor	IntelⓇ Core™ i5-3210M processor (2 cores, 2.50GHz, 3MB cache), Intel Turbo Boost™ 2.0 (3.10GHz), HT™
Motherboard	Lenovo 2344 (Mobile Intel QM77 Express Chipset)
Cooling	Single-pipe heatsink + Delta fan
Memory	2x 8GB KingstonⓇ HyperX™ Impact 2133MHz DDR3L SO-DIMM
Video Card(s)	Intel HD Graphics™ 4000 (GPU clk: 1100MHz, vRAM clk: 1066MHz)
Storage	SamsungⓇ 860 EVO mSATA (250GB) + 850 EVO (500GB) SATA
Display(s)	14.0" (355mm) HD (1366x768) color, anti-glare, LED backlight, 200 nits, 16:9 aspect ratio, 300:1 co
Case	ThinkPad Roll Cage (one-piece magnesium frame)
Audio Device(s)	HD Audio, RealtekⓇ ALC3202 codec, DolbyⓇ Advanced Audio™ v2 / stereo speakers, 1W x 2
Power Supply	ThinkPad 65W AC Adapter + ThinkPad Battery 70++ (9-cell)
Mouse	TrackPointⓇ pointing device + UltraNav™, wide touchpad below keyboard + ThinkLight™
Keyboard	6-row, 84-key, ThinkVantage button, spill-resistant, multimedia Fn keys, LED backlight (PT Layout)
Software	MicrosoftⓇ WindowsⓇ 10 x86-64 (22H2)

System Name	PC
Processor	AMD Ryzen 3600
Motherboard	MSI B450 Mortar Max
Cooling	Phanteks PH-TC12DX, 3 x NZXT FN 140mm, 1x NZXT FV V2 120mm
Memory	32gb DDR4 3200mhz
Video Card(s)	ASUS R9 290 DCII-OC 4GB
Storage	corsair mp600 1TB
Display(s)	LG 27MB85Z 27" 1440p
Case	NZXT Source 340
Power Supply	Thermaltake 675w
Mouse	Logitech G500S
Keyboard	Logitech G510S
Software	Windows 8.1 64 bit

System Name	Pioneer
Processor	Ryzen 9 9950X
Motherboard	MSI MAG X670E Tomahawk Wifi
Cooling	Noctua NH-D15 + A whole lotta Sunon, Phanteks and Corsair Maglev blower fans...
Memory	64GB (2x 32GB) G.Skill Flare X5 @ DDR5-6200(Running 1T no GDM)
Video Card(s)	XFX RX 7900 XTX Speedster Merc 310
Storage	Intel 5800X Optane 800GB boot, +2x Crucial P5 Plus 2TB PCIe 4.0 NVMe SSDs, 1x 2TB Seagate Exos 3.5"
Display(s)	55" LG 55" B9 OLED 4K Display
Case	Thermaltake Core X31
Audio Device(s)	TOSLINK->Schiit Modi MB->Asgard 2 DAC Amp->AKG Pro K712 Headphones or HDMI->B9 OLED
Power Supply	FSP Hydro Ti Pro 850W
Mouse	Logitech G305 Lightspeed Wireless
Keyboard	WASD Code v3 with Cherry Green keyswitches + PBT DS keycaps
Software	Gentoo Linux x64, other office machines run Windows 11 Enterprise