Aquinus
Resident Wat-man
- Joined
- Jan 28, 2012
- Messages
- 13,227 (2.71/day)
- Location
- Concord, NH, USA
System Name | Apollo |
---|---|
Processor | Intel Core i9 9880H |
Motherboard | Some proprietary Apple thing. |
Memory | 64GB DDR4-2667 |
Video Card(s) | AMD Radeon Pro 5600M, 8GB HBM2 |
Storage | 1TB Apple NVMe, 2TB external SSD, 4TB external HDD for backup. |
Display(s) | 32" Dell UHD, 27" LG UHD, 28" LG 5k |
Case | MacBook Pro (16", 2019) |
Audio Device(s) | AirPods Pro, Sennheiser HD 380s w/ FIIO Alpen 2, or Logitech 2.1 Speakers |
Power Supply | Display or Thunderbolt 4 Hub |
Mouse | Logitech G502 |
Keyboard | Logitech G915, GL Clicky |
Software | MacOS 15.3.1 |
Thanks for the explanation. My C-foo is pretty weak since I haven't done it since being in school working on my undergrad but, that's what it was starting to look like to me as well. There are some things in the code that bother me, such as the "goto" statements. It feels like it was written by a dev that writes drivers or low-level OS code. It appears that practically no heuristics are used which also drives me nuts.Not really possible because openmp handles thread scheduling by itself, you just use #pragma omp parallel for construct before your for loop and the openmp distributes iterations to different threads.
What you can do is choose from 4 modes for scheduler: static, dynamic, guided or runtime. Last two are special cases of dynamic.
Basically static is with least locking, does simple round robin and expects that calculated iteration count in the for loop never changes so the chunks can be calculated at compile time.
Dynamic calculates all chunks in runtime and requires more locking.
Here the iteration count of the for loop that get parallelized is 18 and static scheduling could be used but each iteration is heavy and long running, so the granularity is too coarse to harvest more efficiency by modifying thread scheduling. This is why scaling is off and true scaling would be seen on 18+ core xeons.
Additionally this code could not be parallelized with finer granularity because only the calculation of each scenario of the first queen position (and the subsequent brute force search down the hierarchy) is independent of each other.
With that said, I've started writing another version that is written in Clojure that focuses on using set logic and sets of positions to find solutions instead. We'll see how it goes.
This happens because some jobs finish faster than others so one thread might finish before another. OpenMP is a very old way of going about multi-threading and doesn't work well if each loop iteration doesn't take a consistent amount of time to execute. It's a bad way to build an application in my opinion, the dev needs more flexibility to express the problem at hand, not work around the shortcomings of the language. If I finish my version, I'll post it. I'm pretty determined to finish it right now but, my determination might waiver as the day goes on. I'm having fun thinking about it, that much is certain.Explains why it falls to 2-3 (n=18 or 19 in the case of large) threads on my system. It knocks out the first 8 (100% CPU), then the second 8 (100% falling off), leaving the remaining 2-3 (32.5% falling off). Without major reworking of the algorithm, it does not make a good benchmark because of that bias.