• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

New multithreaded CPU benchmark: "Eight queens puzzle"

Not really possible because openmp handles thread scheduling by itself, you just use #pragma omp parallel for construct before your for loop and the openmp distributes iterations to different threads.
What you can do is choose from 4 modes for scheduler: static, dynamic, guided or runtime. Last two are special cases of dynamic.
Basically static is with least locking, does simple round robin and expects that calculated iteration count in the for loop never changes so the chunks can be calculated at compile time.
Dynamic calculates all chunks in runtime and requires more locking.
Here the iteration count of the for loop that get parallelized is 18 and static scheduling could be used but each iteration is heavy and long running, so the granularity is too coarse to harvest more efficiency by modifying thread scheduling. This is why scaling is off and true scaling would be seen on 18+ core xeons.
Additionally this code could not be parallelized with finer granularity because only the calculation of each scenario of the first queen position (and the subsequent brute force search down the hierarchy) is independent of each other.
Thanks for the explanation. My C-foo is pretty weak since I haven't done it since being in school working on my undergrad but, that's what it was starting to look like to me as well. There are some things in the code that bother me, such as the "goto" statements. It feels like it was written by a dev that writes drivers or low-level OS code. It appears that practically no heuristics are used which also drives me nuts.

With that said, I've started writing another version that is written in Clojure that focuses on using set logic and sets of positions to find solutions instead. We'll see how it goes.
Explains why it falls to 2-3 (n=18 or 19 in the case of large) threads on my system. It knocks out the first 8 (100% CPU), then the second 8 (100% falling off), leaving the remaining 2-3 (32.5% falling off). Without major reworking of the algorithm, it does not make a good benchmark because of that bias.
This happens because some jobs finish faster than others so one thread might finish before another. OpenMP is a very old way of going about multi-threading and doesn't work well if each loop iteration doesn't take a consistent amount of time to execute. It's a bad way to build an application in my opinion, the dev needs more flexibility to express the problem at hand, not work around the shortcomings of the language. If I finish my version, I'll post it. I'm pretty determined to finish it right now but, my determination might waiver as the day goes on. I'm having fun thinking about it, that much is certain.
 
I think I would do either n^2 tasks or n^2 * the-next-move tasks. That would sufficiently spread out the load to prevent major bias.

The C code is clearly designed for performance which is why it is written like a driver.


I'm really not in the mood for coding right now so go for it @Aquinus! :toast:

Improving the multithreading of takaken 2011's code would be the fastest performance-wise though.
 
Last edited:
There are some things in the code that bother me, such as the "goto" statements.
Yeah it gets weird when one implements recursive algorithms without actually using recursion :laugh: ... clojure as a functional programming language emphasizing recursive algorithms could be a better fit for this specific problem, but I don't know about performance since it's compiles as bytecode for JVM. Nevertheless I'm curious so I say go for it :D
 
i7-4790K@4.6GHz
4790k@46GHz.png


FX-8350@4.6GHz
8350@46GHz.png
 
i7-4790K@4.6GHz


FX-8350@4.6GHz

The 8350 HT link is supposed to sit around 2600, unless power saving features are dropping the value.
 
Yeah it gets weird when one implements recursive algorithms without actually using recursion :laugh: ... clojure as a functional programming language emphasizing recursive algorithms could be a better fit for this specific problem, but I don't know about performance since it's compiles as bytecode for JVM. Nevertheless I'm curious so I say go for it :D
loop-recur does offer a non-stack consuming TCO-like way of solving the same problems but, that's not the main benefit. Immutable data = shared data under the hood which means less duplication of data and less consumption of system memory. For a problem like this, keeping the memory footprint down is important because as you use larger boards, more memory is required to manage the of each "possibility". I'm using heurisitics to find only the positions that it can move forward so, I'm not purely brute-forcing it either but, I think a good language to express a problem is sometimes worth the loss in performance. C is important if you need response times under a millisecond but, I would argue that a good expressive language can do the same thing, better, faster, and with less code.

Now that I've said that, I'm going to keep working on it because, I feel like I've set the bar kind of high for myself. :laugh:

Edit: Functional languages are nice because I can write my code like a math problem. I think that OO and traditional imperative code doesn't express these kinds of problems well.
 
Last edited:
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.62

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 9.94

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 6.01

Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 5.14

@
cpuid.png


A little older Fritz chess benchmark is also a interesting benchmark, but gets cpu a bit hotter.. Something like 3dmark Vantage physics hot.
https://www.chess.com/download/view/fritz-12-benchmark
 
No old school results yet? Q9550 @3.5Ghz


Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 22.05

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 16.10

Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 10.28

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 9.71
 
Edit: Functional languages are nice because I can write my code like a math problem. I think that OO and traditional imperative code doesn't express these kinds of problems well.
Both of my programs define Cell and Grid (aka board) objects. All of the logic is in the cell. Because of that, the recursion code is literally only a few lines. The problem is I can't do the goto...label statements like takaken 2011's code does (or at least I think I can't) so it becomes a memory monster.
 
Both of my programs define Cell and Grid (aka board) objects. All of the logic is in the cell. Because of that, the recursion code is literally only a few lines. The problem is I can't do the goto...label statements like takaken 2011's code does (or at least I think I can't) so it becomes a memory monster.
That's exactly why OO languages are bad at expressing this kind of problem IMHO. The reality is that objects are a bad way to conceptualize a problem and it forces us to make some bad decisions about how we write our code. I have an initial version done but, it requires some optimization. It's nowhere near as fast but, it keeps track of all of the unique solutions and implements several stages of queues which the workers pull from (always taking from the queue closest to completion that has items to process.) I also have several ideas to improve performance (outside of multi-threading, already done that enough where I'm eating up my 3820's compute capability.)

Just a quick overview of how I attempted to tackle the problem:

Each task cascades from one queue to the next with the available spots that could be made with a given combination of queens. The available spots are calculated using set logic. The available spots are a set and I have a function that calculates the set of invalid moves for any given position. The difference between the available positions and the new invalid positions tells us if we can make another move (if we're not at our target and there are no more moves, the job is done,) and if we've hit our target, the finished data is put on a "valid" queue where another thread takes those items and converts the list of queens into a set and adds those positions to a final set. Right now, the invalid move function is being called every time and it takes about ~5-6ms to run on my machine. Since none of these are changing, I'm considering do these calculations ahead of time so when the hard work is done, I'm merely doing a hash map lookup which should be significantly faster.

Doing it this way is most definitely heavier-weight and most definitely isn't as fast. For what it's worth, doing an 8x8 doesn't consume more than 1.2GB (so far,) using my method and increasing the board size should require more compute with my method, not too much more memory in comparison.

Either way, I have to finish it up and I have some optimization to do before it's ready for public testing.

This is what I have so far though: https://github.com/jrdoane/queens/blob/master/src/queens/core.clj

Edit: I just noticed that I can further reduce how many items "flow" up the series of queues by checking to see if I've encountered the set of queens before. The performance overhead of keeping track of that might be worth the speed up as it could be reducing the number of possible computations upstream by a significant number but, I'm not exactly certain yet.

Edit 2: Using a pre-calculated two level nested map improves calculation time for invalid spots from ~5-6ms to ~ 0.05-0.125ms. That change is now a no-brainer because the required storage and work ahead of time is minimal for such a huge gain.
 
Last edited:
i7-4770K box.

q score1.JPG
 
pretty neat. /tag :)
 
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.62

Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 9.94

Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 6.01

Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 5.14

@
View attachment 75390


A little older Fritz chess benchmark is also a interesting benchmark, but gets cpu a bit hotter.. Something like 3dmark Vantage physics hot.
https://www.chess.com/download/view/fritz-12-benchmark

Fritz Chess is not so reliable benchmark if you use chess engine for real world. All chess engine although can scale very well with multiple cores (for example the strongest chess engine in the world Stockfish 7 x64 BMI2 can use 128 cores) but it can't utilize Hyperthreading properly.
HT will improve kilo node per sec by 30-40% but the strength of engine will suffer (measure in ELO rating) (go wider but less deeper).
That is why all chess engine manual said that you should turn off HT.

I love overclocking and also love chess. :clap:
 
Back
Top