New multithreaded CPU benchmark: "Eight queens puzzle"

Aquinus · Jun 11, 2016

BiggieShady said:
Not really possible because openmp handles thread scheduling by itself, you just use #pragma omp parallel for construct before your for loop and the openmp distributes iterations to different threads.
What you can do is choose from 4 modes for scheduler: static, dynamic, guided or runtime. Last two are special cases of dynamic.
Basically static is with least locking, does simple round robin and expects that calculated iteration count in the for loop never changes so the chunks can be calculated at compile time.
Dynamic calculates all chunks in runtime and requires more locking.
Here the iteration count of the for loop that get parallelized is 18 and static scheduling could be used but each iteration is heavy and long running, so the granularity is too coarse to harvest more efficiency by modifying thread scheduling. This is why scaling is off and true scaling would be seen on 18+ core xeons.
Additionally this code could not be parallelized with finer granularity because only the calculation of each scenario of the first queen position (and the subsequent brute force search down the hierarchy) is independent of each other.

Thanks for the explanation. My C-foo is pretty weak since I haven't done it since being in school working on my undergrad but, that's what it was starting to look like to me as well. There are some things in the code that bother me, such as the "goto" statements. It feels like it was written by a dev that writes drivers or low-level OS code. It appears that practically no heuristics are used which also drives me nuts.

With that said, I've started writing another version that is written in Clojure that focuses on using set logic and sets of positions to find solutions instead. We'll see how it goes.

FordGT90Concept said:
Explains why it falls to 2-3 (n=18 or 19 in the case of large) threads on my system. It knocks out the first 8 (100% CPU), then the second 8 (100% falling off), leaving the remaining 2-3 (32.5% falling off). Without major reworking of the algorithm, it does not make a good benchmark because of that bias.

This happens because some jobs finish faster than others so one thread might finish before another. OpenMP is a very old way of going about multi-threading and doesn't work well if each loop iteration doesn't take a consistent amount of time to execute. It's a bad way to build an application in my opinion, the dev needs more flexibility to express the problem at hand, not work around the shortcomings of the language. If I finish my version, I'll post it. I'm pretty determined to finish it right now but, my determination might waiver as the day goes on. I'm having fun thinking about it, that much is certain.

FordGT90Concept · Jun 11, 2016

I think I would do either n^2 tasks or n^2 * the-next-move tasks. That would sufficiently spread out the load to prevent major bias.

The C code is clearly designed for performance which is why it is written like a driver.

I'm really not in the mood for coding right now so go for it @Aquinus! :toast:

Improving the multithreading of takaken 2011's code would be the fastest performance-wise though.

BiggieShady · Jun 11, 2016

Aquinus said:
There are some things in the code that bother me, such as the "goto" statements.

Yeah it gets weird when one implements recursive algorithms without actually using recursion :laugh:

... clojure as a functional programming language emphasizing recursive algorithms could be a better fit for this specific problem, but I don't know about performance since it's compiles as bytecode for JVM. Nevertheless I'm curious so I say go for it

stealth83 · Jun 11, 2016

newtekie1 · Jun 11, 2016

i7-4790K@4.6GHz

FX-8350@4.6GHz

JrockTech · Jun 11, 2016

newtekie1 said:
i7-4790K@4.6GHz

FX-8350@4.6GHz

The 8350 HT link is supposed to sit around 2600, unless power saving features are dropping the value.

Aquinus · Jun 11, 2016

BiggieShady said:
Yeah it gets weird when one implements recursive algorithms without actually using recursion ... clojure as a functional programming language emphasizing recursive algorithms could be a better fit for this specific problem, but I don't know about performance since it's compiles as bytecode for JVM. Nevertheless I'm curious so I say go for it

loop-recur does offer a non-stack consuming TCO-like way of solving the same problems but, that's not the main benefit. Immutable data = shared data under the hood which means less duplication of data and less consumption of system memory. For a problem like this, keeping the memory footprint down is important because as you use larger boards, more memory is required to manage the of each "possibility". I'm using heurisitics to find only the positions that it can move forward so, I'm not purely brute-forcing it either but, I think a good language to express a problem is sometimes worth the loss in performance. C is important if you need response times under a millisecond but, I would argue that a good expressive language can do the same thing, better, faster, and with less code.

Now that I've said that, I'm going to keep working on it because, I feel like I've set the bar kind of high for myself. :laugh:

Edit: Functional languages are nice because I can write my code like a math problem. I think that OO and traditional imperative code doesn't express these kinds of problems well.

TheHunter · Jun 11, 2016

Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.62

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 9.94

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 6.01

Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 5.14

@

A little older Fritz chess benchmark is also a interesting benchmark, but gets cpu a bit hotter.. Something like 3dmark Vantage physics hot.
https://www.chess.com/download/view/fritz-12-benchmark

JrRacinFan · Jun 11, 2016

IRQ Conflict · Jun 12, 2016

No old school results yet? Q9550 @3.5Ghz

Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 22.05

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 16.10

Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 10.28

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 9.71

FordGT90Concept · Jun 12, 2016

Aquinus said:
Edit: Functional languages are nice because I can write my code like a math problem. I think that OO and traditional imperative code doesn't express these kinds of problems well.

Both of my programs define Cell and Grid (aka board) objects. All of the logic is in the cell. Because of that, the recursion code is literally only a few lines. The problem is I can't do the goto...label statements like takaken 2011's code does (or at least I think I can't) so it becomes a memory monster.

Aquinus · Jun 12, 2016

FordGT90Concept said:
Both of my programs define Cell and Grid (aka board) objects. All of the logic is in the cell. Because of that, the recursion code is literally only a few lines. The problem is I can't do the goto...label statements like takaken 2011's code does (or at least I think I can't) so it becomes a memory monster.

That's exactly why OO languages are bad at expressing this kind of problem IMHO. The reality is that objects are a bad way to conceptualize a problem and it forces us to make some bad decisions about how we write our code. I have an initial version done but, it requires some optimization. It's nowhere near as fast but, it keeps track of all of the unique solutions and implements several stages of queues which the workers pull from (always taking from the queue closest to completion that has items to process.) I also have several ideas to improve performance (outside of multi-threading, already done that enough where I'm eating up my 3820's compute capability.)

Just a quick overview of how I attempted to tackle the problem:

Each task cascades from one queue to the next with the available spots that could be made with a given combination of queens. The available spots are calculated using set logic. The available spots are a set and I have a function that calculates the set of invalid moves for any given position. The difference between the available positions and the new invalid positions tells us if we can make another move (if we're not at our target and there are no more moves, the job is done,) and if we've hit our target, the finished data is put on a "valid" queue where another thread takes those items and converts the list of queens into a set and adds those positions to a final set. Right now, the invalid move function is being called every time and it takes about ~5-6ms to run on my machine. Since none of these are changing, I'm considering do these calculations ahead of time so when the hard work is done, I'm merely doing a hash map lookup which should be significantly faster.

Doing it this way is most definitely heavier-weight and most definitely isn't as fast. For what it's worth, doing an 8x8 doesn't consume more than 1.2GB (so far,) using my method and increasing the board size should require more compute with my method, not too much more memory in comparison.

Either way, I have to finish it up and I have some optimization to do before it's ready for public testing.

This is what I have so far though: https://github.com/jrdoane/queens/blob/master/src/queens/core.clj

Edit: I just noticed that I can further reduce how many items "flow" up the series of queues by checking to see if I've encountered the set of queens before. The performance overhead of keeping track of that might be worth the speed up as it could be reducing the number of possible computations upstream by a significant number but, I'm not exactly certain yet.

Edit 2: Using a pre-calculated two level nested map improves calculation time for invalid spots from ~5-6ms to ~ 0.05-0.125ms. That change is now a no-brainer because the required storage and work ahead of time is minimal for such a huge gain.

RealNeil · Jun 12, 2016

i7-4770K box.

Ahhzz · Jun 12, 2016

pretty neat. /tag

Enterprise24 · Jun 12, 2016

TheHunter said:
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.62

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 9.94

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 6.01

Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 5.14

@
View attachment 75390

A little older Fritz chess benchmark is also a interesting benchmark, but gets cpu a bit hotter.. Something like 3dmark Vantage physics hot.
https://www.chess.com/download/view/fritz-12-benchmark

Fritz Chess is not so reliable benchmark if you use chess engine for real world. All chess engine although can scale very well with multiple cores (for example the strongest chess engine in the world Stockfish 7 x64 BMI2 can use 128 cores) but it can't utilize Hyperthreading properly.
HT will improve kilo node per sec by 30-40% but the strength of engine will suffer (measure in ELO rating) (go wider but less deeper).
That is why all chess engine manual said that you should turn off HT.

I love overclocking and also love chess. :clap:

broken pixel · Jun 12, 2016

https://www.chess.com/forum/view/general/best-cpu-for-chess-engine-game-analysis
a7 cc

System Name	Apollo
Processor	Intel Core i9 9880H
Motherboard	Some proprietary Apple thing.
Memory	64GB DDR4-2667
Video Card(s)	AMD Radeon Pro 5600M, 8GB HBM2
Storage	1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup.
Display(s)	32" Dell UHD, 27" LG UHD, 28" LG 5k
Case	MacBook Pro (16", 2019)
Audio Device(s)	AirPods Pro, AirPods Max
Power Supply	Display or Thunderbolt 4 Hub
Mouse	Logitech G502
Keyboard	Logitech G915, GL Clicky
Software	MacOS 15.5

System Name	BY-2021
Processor	AMD Ryzen 7 5800X (65w eco profile)
Motherboard	MSI B550 Gaming Plus
Cooling	Scythe Mugen (rev 5)
Memory	2 x Kingston HyperX DDR4-3200 32 GiB
Video Card(s)	AMD Radeon RX 7900 XT
Storage	Samsung 980 Pro, Seagate Exos X20 TB 7200 RPM
Display(s)	Nixeus NX-EDG274K (3840x2160@144 DP) + Samsung SyncMaster 906BW (1440x900@60 HDMI-DVI)
Case	Coolermaster HAF 932 w/ USB 3.0 5.25" bay + USB 3.2 (A+C) 3.5" bay
Audio Device(s)	Realtek ALC1150, Micca OriGen+
Power Supply	Enermax Platimax 850w
Mouse	Nixeus REVEL-X
Keyboard	Tesoro Excalibur
Software	Windows 10 Home 64-bit
Benchmark Scores	Faster than the tortoise; slower than the hare.

System Name	Windows 10 64-bit Core i7 6700
Processor	Intel Core i7 6700
Motherboard	Asus Z170M-PLUS
Cooling	Corsair AIO
Memory	2 x 8 GB Kingston DDR4 2666
Video Card(s)	Gigabyte NVIDIA GeForce GTX 1060 6GB
Storage	Western Digital Caviar Blue 1 TB, Seagate Baracuda 1 TB
Display(s)	Dell P2414H
Case	Corsair Carbide Air 540
Audio Device(s)	Realtek HD Audio
Power Supply	Corsair TX v2 650W
Mouse	Steelseries Sensei
Keyboard	CM Storm Quickfire Pro, Cherry MX Reds
Software	MS Windows 10 Pro 64-bit

System Name	My-Gaming-Rig
Processor	12900K
Motherboard	MSI PRO Z790-S WIFI
Memory	32gb (2x16) TCreate @6000
Video Card(s)	Gygabyte RTX 4080 Super OC
Storage	2gb nvme and 4gb ssd
Display(s)	45in LG Ultragear
Case	Corsair
Audio Device(s)	Razer BlackShark V2
Power Supply	Corsair RMX1000
Mouse	Logitech G303
Keyboard	Logitech G710+
VR HMD	Oculus Rift
Software	Windows 11 pro x64

Processor	Intel Core i7 10850K@5.2GHz
Motherboard	AsRock Z470 Taichi
Cooling	Corsair H115i Pro w/ Noctua NF-A14 Fans
Memory	32GB DDR4-3600
Video Card(s)	RTX 2070 Super
Storage	500GB SX8200 Pro + 8TB with 1TB SSD Cache
Display(s)	Acer Nitro VG280K 4K 28"
Case	Fractal Design Define S
Audio Device(s)	Onboard is good enough for me
Power Supply	eVGA SuperNOVA 1000w G3
Software	Windows 10 Pro x64

New multithreaded CPU benchmark: "Eight queens puzzle"

Aquinus

Resident Wat-man

FordGT90Concept

"I go fast!1!11!1!"

BiggieShady

stealth83

newtekie1

Semi-Retired Folder

JrockTech

Aquinus

Resident Wat-man

TheHunter

JrRacinFan

Served 5k and counting ...

IRQ Conflict

FordGT90Concept

"I go fast!1!11!1!"

Aquinus

Resident Wat-man

RealNeil

Ahhzz

Super Moderator

Enterprise24

broken pixel

System Name	I overclock AMD setups
Processor	AMD 8320+ @ 4.95GHZ / AMD 6300 @ 4.8 Ghz / AMD 8350 @ In RMA
Motherboard	Gigabyte 990FXA-UD3 / Gigabyte 970-D3P
Cooling	Corsair H100
Memory	16GB DDR3 Corsair Vengace
Video Card(s)	MSI GAMING GTX 980 @ 1545Mhz core
Storage	Samsung SSD 850 EVO 250GB
Display(s)	Acer 144hz
Case	Coolermaster CM 690 III (White Version)
Audio Device(s)	Creative Titanium Fatality Pro
Power Supply	Corsair Hx750i
Mouse	Logitech G300s
Keyboard	Microsoft Digital Media
Software	Windows 10 64
Benchmark Scores	23.0k on Skydiver, 8.1k on Firestrike. 1.5k single, 9.7k multi CPU-Z

System Name	-aLiEn beaTs-
Processor	Intel i7 11700kf @ 5.055Ghz
Motherboard	MSI Z490 Unify
Cooling	Corsair H115i Pro RGB
Memory	G.skill Royal Silver 4400 cl17 @ 4403mhz
Video Card(s)	Inno3d RTX 3080TI Ichill black @ UV 1815MHz
Storage	nvme WD KC3000 2TB, Crucial MX300 & MX500
Display(s)	Samsung C24FG73 144HZ
Case	CoolerMaster HAF 932 USB3.0
Audio Device(s)	X-Fi Titanium HD @ 2.1 Bose acoustimass 5
Power Supply	CoolerMaster 850W v2 gold atx 2.52
Mouse	Razer viper 8k
Keyboard	Logitech G19s
Software	Windows 11 Pro 24h2
Benchmark Scores	► ♪♫♪♩♬♫♪♭

System Name	Snow White
Processor	Ryzen 7900x3d
Motherboard	AsRock B650E Steel Legend
Cooling	Custom Water 1x420
Memory	32GB T-Force Deltas
Video Card(s)	PowerColor 7900 XTX Liquid Devil
Storage	20+ TB
Display(s)	Sammy 49" 5k Ultrawide
Case	Tt CTE 600 Snow Edition
Audio Device(s)	Onboard
Power Supply	EVGA 1200W P2
Mouse	Corsair M65 RGB Elite White
Keyboard	Corsair K65 Mini
Software	Windows 10
Benchmark Scores	Avermedia Live HD2

System Name	Fluffy
Processor	Ryzen 7 2700X
Motherboard	Asus Crosshair VII Hero Wi-Fi
Cooling	Wraith Spire
Memory	32Gb's Gskill Trident Z DDR4 3200 CAS 14
Video Card(s)	Asus Strix Vega 64 OC
Storage	Crucial BX100 500GB SSD/Seagate External USB 1TB
Display(s)	Samsung CHG70 32" 144hz HDR
Case	Phanteks ENTHOO EVOLV X
Audio Device(s)	SupremeFX S1220 / Tiamat 7.1
Power Supply	SeaSonic PRIME Ultra Titanium 750 W
Mouse	Steel Series Rival 600
Keyboard	Razer Black Widow Ultimate
Software	Open Office, Win 10 Pro

System Name	Home Brewed
Processor	i9-7900X and i7-8700K
Motherboard	ASUS ROG Rampage VI Extreme & ASUS Prime Z-370 A
Cooling	Corsair 280mm AIO & Thermaltake Water 3.0
Memory	64GB DDR4-3000 GSKill RipJaws-V & 32GB DDR4-3466 GEIL Potenza
Video Card(s)	2X-GTX-1080 SLI & 2 GTX-1070Ti 8GB G1 Gaming in SLI
Storage	Both have 2TB HDDs for storage, 480GB SSDs for OS, and 240GB SSDs for Steam Games
Display(s)	ACER 28" B286HK 4K & Samsung 32" 1080P
Case	NZXT Source 540 & Rosewill Rise Chassis
Audio Device(s)	onboard
Power Supply	Corsair RM1000 & Corsair RM850
Mouse	Generic
Keyboard	Razer Blackwidow Tournament & Corsair K90
Software	Win-10 Professional
Benchmark Scores	yes

System Name	OrangeHaze / Silence
Processor	i7-13700KF / i5-10400 /
Motherboard	ROG STRIX Z690-E / MSI Z490 A-Pro Motherboard
Cooling	Corsair H75 / TT ToughAir 510
Memory	64Gb GSkill Trident Z5 / 32GB Team Dark Za 3600
Video Card(s)	Palit GeForce RTX 2070 / Sapphire R9 290 Vapor-X 4Gb
Storage	Hynix Plat P41 2Tb\Samsung MZVL21 1Tb / Samsung 980 Pro 1Tb
Display(s)	22" Dell Wide/24" Asus
Case	Lian Li PC-101 ATX custom mod / Antec Lanboy Air Black & Blue
Audio Device(s)	SB Audigy 7.1
Power Supply	Corsair Enthusiast TX750
Mouse	Logitech G502 Lightspeed Wireless / Logitech G502 Proteus Spectrum
Keyboard	K68 RGB — CHERRY® MX Red
Software	Win10 Pro \ RIP:Win 7 Ult 64 bit

System Name	Can I run it
Processor	AMD Ryzen 9 7950X3D @ 2200Mhz FCLK (The rest is still tuning)
Motherboard	Gigabyte B650E Aorus Master
Cooling	Thermaltake TH420 V2 White
Memory	KLEVV CRAS V RGB DDR5 48GB (2x24GB)7200 MT/s 34-44-44-84 @ 8000 MT/s 36-49-46-76 1.52V VDD/1.4V VDDQ
Video Card(s)	ASUS Strix RTX 4090 LC OC with two more T30 @ +100mv +150Mhz core +1963Mhz mem (~3045Mhz core)
Storage	990 Pro 4TB (Game) Transcend 220S 1TB (Win) WD 250GB (Linux) Galax 120GB (OC test) Seagate HDD 4TB
Display(s)	Samsung Odyssey OLED G9 49" 5120x1440 240Hz calibrated by X-Rite i1 Display Pro Plus
Case	Coolermaster HAF 700 White with 9x Phanteks T30
Audio Device(s)	Q Acoustics M20 HD speakers with Q Acoustics QB12 subwoofer
Power Supply	Thermaltake PF3 1200W 80+ Platinum
Mouse	Logitech G Pro Wireless
Keyboard	Logitech G913 (GL Linear)
VR HMD	Logitech G923 with Logitech Driving Force Shifter
Software	Windows 11, Ubuntu 24.10

System Name	X99
Processor	5930K @ 4.7GHz @ 1.323v
Motherboard	Rampage V Edition 10
Cooling	EK
Memory	Dominator Platinum 32GB
Video Card(s)	2x Gigabyte xtreme gaming 980ti
Storage	Samsung 950 Pro M.2, 850 Pro & WD320
Display(s)	Tempest X270OC @100Hz
Case	Thermaltake Core P5
Audio Device(s)	On-board
Power Supply	120-G2-1600-X1
Mouse	Mamba 2012
Keyboard	K70
Software	Win10
Benchmark Scores	http://www.3dmark.com/fs/6823139