• Welcome to TechPowerUp Forums, Guest! Please check out our forum guidelines for info related to our community.
  • The forums have been upgraded with support for dark mode. By default it will follow the setting on your system/browser. You may override it by scrolling to the end of the page and clicking the gears icon.

New multithreaded CPU benchmark: "Eight queens puzzle"

Program foes not work for me? Says its testing 12 threads then the window clothes after a few seconds. ;*(


open command prompt as admin then drag and drop q.exe then hit enter
 
I think I tried that, I will try again when I get back to the my office.
 
When you unzip the file, hold down the shift key while right clicking on the folder, and choose the "Open Command Window Here."
 
Forgot to make a screenshot, but the results are a bit off:
Xeon X5650 running @3.3GHz
~24 sec for 1 thread
~6.5 sec for 12 threads
and everything in-between for 2,4,6,8 threads

CPU is barely breaking a sweat - 9-12% load regardless of #threads, and it does not even go to turbo (workload is not intense enough).
I did not look at the code yet, but I suspect there is something holding back the multi-threaded part. I did mess with openMP and OpenMPI (some simple image processing stuff) a few years ago and never seen such abnormal scaling...
 
the only way i can get it to work is to hold shift, right click, open command window, then drag n drop, and it only runs a single pass @ 8 threads.
im sure i could modify that command line @ the top of the console, but i have no time to right now.
Intel Xeon E3 1231V3 @ 3.4-3.8Ghz
picked up the xeon which was Suppossed to come via newegg/Fedex, but wasnt delivered , got it local, lovinig the threads.sadly ill need to give it away to its owner tho :(
5EWS2zw.png
 
Forgot to make a screenshot, but the results are a bit off:
Xeon X5650 running @3.3GHz
~24 sec for 1 thread
~6.5 sec for 12 threads
and everything in-between for 2,4,6,8 threads

CPU is barely breaking a sweat - 9-12% load regardless of #threads, and it does not even go to turbo (workload is not intense enough).
I did not look at the code yet, but I suspect there is something holding back the multi-threaded part. I did mess with openMP and OpenMPI (some simple image processing stuff) a few years ago and never seen such abnormal scaling...

Yeah for something that resolves so fast, I would expect the 12 threads to have micromanagement issues before being able to really squeak out any speeds below 3s. continuous runs seem to have a cache reliance as well.
 
Yeah for something that resolves so fast, I would expect the 12 threads to have micromanagement issues before being able to really squeak out any speeds below 3s. continuous runs seem to have a cache reliance as well.
I've ran it several times and it peaked out at 61% CPU. It climbs up, hits a peak, then climbs down.

For comparison, I ran Chess repeatedly and it took about 20 times to find one that took longer than four seconds to execute (so Task Manager can register it) and the peak was 87%. Chess is async multithreaded so as long as the stack of work is sufficiently large enough, it will load the CPU to 100%.

Q often doesn't even exceed 50% over 7 seconds.

Comparing the two programs in Task Manager, I think it is possible that a 12 core, finishing in ~3 seconds, may never be able to reach 100% load.

First thing I'd try is making the board larger. 12 threads should take at least 5-10 seconds to finish.
 
Last edited:
Mine are

Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 13.69
Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 6.10
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 4.94
Using 12 thread(s).
Elapsed time (hh:mm:ss:cs): 3.48
 
6700K @ 4.7

8 threads = 5.04
 
  • Like
Reactions: xvi
Okay, here's a few things.

1.) I put the results we have so far into a table and sorted it by the one-thread computing times.

2.) I made a few changes to the program:
2a) Not entering (or entering an invalid) number of threads to use will now make the program iterate through all available threads settings. That is, if you have, say, four cores and just run the program, it will compute the results based on four threads, then three, then two, then one.
2b) For those of you with rather fast CPUs I added a command line option "large". This will switch n from 18 to 19. Lots of more computations to do, will take a minute even on the fastest of CPUs.
2c) In either case, the program will now wait for the user to press the enter key after it's done. This will prevent the window from closing.

For the larger chessboard I figured that the launched parallel threads may have very different computation times. For the "large" chessboard, I see a clear behavior on my quadcore: First, the CPU load hits 100%. After a while, it declines to 75%, showing that one of the four threads is done, the other three ones still working. After few seconds, the load drops to 50%, then 25%, then the program is done. Can you confirm this behavior?
 

Attachments

  • Results.png
    Results.png
    18.7 KB · Views: 228
  • Q.zip
    Q.zip
    64.7 KB · Views: 138
It works funny with many cores indeed.

Using 12 thread(s).
Elapsed time (hh:mm:ss:cs): 3.40
Using 11 thread(s).
Elapsed time (hh:mm:ss:cs): 3.35
Using 10 thread(s).
Elapsed time (hh:mm:ss:cs): 3.52
Using 9 thread(s).
Elapsed time (hh:mm:ss:cs): 3.54
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 4.95
Using 7 thread(s).
Elapsed time (hh:mm:ss:cs): 4.91
Using 6 thread(s).
Elapsed time (hh:mm:ss:cs): 4.86
Using 5 thread(s).
Elapsed time (hh:mm:ss:cs): 5.72
Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 6.01
Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 6.39
Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 10.00
Using 1 thread(s).

And with large

Using 12 thread(s).
Elapsed time (hh:mm:ss:cs): 22.61
Using 11 thread(s).
Elapsed time (hh:mm:ss:cs): 23.65
Using 10 thread(s).
Elapsed time (hh:mm:ss:cs): 23.36
Using 9 thread(s).
Elapsed time (hh:mm:ss:cs): 22.07
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 32.80
Using 7 thread(s).
Elapsed time (hh:mm:ss:cs): 32.79
Using 6 thread(s).
Elapsed time (hh:mm:ss:cs): 31.13
Using 5 thread(s).
Elapsed time (hh:mm:ss:cs): 40.04
Using 4 thread(s).
Elapsed time (hh:mm:ss:cs): 43.58
Using 3 thread(s).
Elapsed time (hh:mm:ss:cs): 52.90
Using 2 thread(s).
Elapsed time (hh:mm:ss:cs): 1:16.15
Using 1 thread(s).
Elapsed time (hh:mm:ss:cs): 1:39.01
 
Last edited:
C:\Users\n0tiert\Downloads\Q>q 8
Using 8 thread(s).
Elapsed time (hh:mm:ss:cs): 8.33

FX-8150@4GHZ

if i add "q.exe N" it only runs once, is that correct ?
 
For the larger chessboard I figured that the launched parallel threads may have very different computation times. For the "large" chessboard, I see a clear behavior on my quadcore: First, the CPU load hits 100%. After a while, it declines to 75%, showing that one of the four threads is done, the other three ones still working. After few seconds, the load drops to 50%, then 25%, then the program is done. Can you confirm this behavior?
It went from 100% for a 2-3 seconds then it plummeted to 25% in under a second or two and stayed there for a while. It presumably then went down to 12.5% and finished. It does not fully utilize the CPU for long.
q8large.png

In the picture, where you see it shoot up from <50% back up to 100%, that's WCG taking back the idle clocks so disregard that...

I believe the steps are:
8 cores -> 6 cores -> 5 cores -> 4 cores -> 3 cores -> 2 cores/BOINC 8 cores
100% -> 75% -> 62.5% -> 50% -> 37.5% -> 25% (doesn't reach it before BOINC takes over)

Over half of the time the program runs, it's using 25% or less of the resources available to it.

Might I suggest using a Queue on the main thread and each core pulling off a job from it? That's how I usually do it.


Also, why run on less than maximum cores by default?
 
Last edited:
The screenshot of Task Manager was with the command line option "large." It's a capture of the 8 thread then 7 thread.
657cso.jpg

120s8lu.jpg

k00a4w.jpg
 
@FordGT90Concept and @biffzinker, depending on how the OP designed the benchmark, you may not see 100% usage even though up to 8 threads are being used.

As Ford said:
Might I suggest using a Queue on the main thread and each core pulling off a job from it? That's how I usually do it.
This is useful if and only if the algorithm is brute forcing the solutions. Using a queue for divvying out tasks is the most basic way to accelerate purely parallel workloads (such as dispatching at the job level,) such as doing it brute force but, it's not the most efficient way to solve the problem. A divide and conquer algorithm will eventually exhibit the behavior that you two are describing. That is, as the task is broken apart, it can utilize more cores and a lot of applications can benefit from this to a point but, requires threads joining up on each other when their "slice" of the calculation is complete, it doesn't allow the computer to put that thread to work elsewhere since there is still a significant amount of serial work that needs to be completed.

Depending on how @BAGZZlash implemented it, a little bit of heuristics, queueing, or deeper level of concurrency aside from the brute-force way might improve performance significantly. I did notice that the OP used OpenMP which means the application is probably written in C or C++ which means that a big hurdle is actually creating the multi-threaded part. I would argue a more dynamic language with richer data structures would probably help at the expense of some computational power.

I did take a peek at the link the OP provided and it appears that the algorithm is most definitely a brute force method that has minimal heuristics. I'm tempted to create my own version but, it probably won't be in C/C++ but rather a language like Clojure.

Would anyone be interested? If there are, it might be an incentive for me to do it.
 
My 1100T

6 Threads 6:38
1 Thread 18:11
 
i5-6500 @ 5Ghz

5-45.png
 
still pass...

5-36.png
 
I did notice that the OP used OpenMP which means the application is probably written in C or C++ which means that a big hurdle is actually creating the multi-threaded part. I would argue a more dynamic language with richer data structures would probably help at the expense of some computational power.
It is C and the source is in the ZIP ("Q.c"). I don't know enough of C to sort through it to see if queuing is possible.
 
FX 8320 @ 4.95Ghz

9Sha2uj.png
 
It is C and the source is in the ZIP ("Q.c"). I don't know enough of C to sort through it to see if queuing is possible.
Not really possible because openmp handles thread scheduling by itself, you just use #pragma omp parallel for construct before your for loop and the openmp distributes iterations to different threads.
What you can do is choose from 4 modes for scheduler: static, dynamic, guided or runtime. Last two are special cases of dynamic.
Basically static is with least locking, does simple round robin and expects that calculated iteration count in the for loop never changes so the chunks can be calculated at compile time.
Dynamic calculates all chunks in runtime and requires more locking.
Here the iteration count of the for loop that get parallelized is 18 and static scheduling could be used but each iteration is heavy and long running, so the granularity is too coarse to harvest more efficiency by modifying thread scheduling. This is why scaling is off and true scaling would be seen on 18+ core xeons.
Additionally this code could not be parallelized with finer granularity because only the calculation of each scenario of the first queen position (and the subsequent brute force search down the hierarchy) is independent of each other.
 
Explains why it falls to 2-3 (n=18 or 19 in the case of large) threads on my system. It knocks out the first 8 (100% CPU), then the second 8 (100% falling off), leaving the remaining 2-3 (32.5% falling off). Without major reworking of the algorithm, it does not make a good benchmark because of that bias.
 
Back
Top