Thanks for the advice on keeping it cool!
I didn't attempt to download and install ThrottleStop 2.99.6, much as I was tempted because I felt it would not solve my problem and complicate matters (and matters are complicated enough already).
I set up some real-time performance monitoring (a la MS) and ran the sim that often but not always caused the shut-down. I confess it was the "not always" bit that got me and and I think I see what's happening now.
Two of my four CPUs (0 and 1) never change temperature under normal circumstances (not even when the sim crashes out). The other two CPUs (2 and 3) tend to be up at about 69 or 70degC when it crashes (but sometimes they sit up at 70degC for hours). I suppose you could say I am half way through the RealTemp calibration to assess the TJmax. However, the differences between the pairs made the whole process somewhat impossible.
I am now convinced that the sensors for GPU0/1 are "stuck at low temperatures" as I believe the RealTemp documentation put it. When I'm stress testing the temperature (even for 0 & 1 do rise) and (amazingly) the system continues to operate (maybe the prime numbers generated by 0 and 1 are all over the shop, though). So we know the processors sensors are producing (some sort of) data for each CPU
So, here's what I think is happening.
The linearisation is not being performed on the raw data from within the processor (for 0 & 1). [Wouldn't that linearisation be done by a separate sensor (management) chip in the Intel® P43 + ICH10 Chipset? Couldn't I just replace that chip?]
Two things, at least, in my opinion, depend on good temperature data from the processor. One is that some part of the software must be making decisions on which processes to put onto which CPU (core) and, one imagines that a high temperature would be a reason for that software to choose another target. If the temperature appears lower then the CPU is targeted, so in our case, sometimes a hot CPU is loaded with more work (making it even hotter, of course). In fact, worse still, the hotter the other CPUs get the more the apparently cooler CPU will be loaded leading to a thermal shutdown. Another thing happened when I was reading up about this MS Performance Monitoring - I was reminded that in MS Task Manager, you can "set affinity" for processes to certain CPUs, so I set the affinity of the main process to the sim to use CPU2 & 3. The thing ran for hours and hours. However, to my total amazement, the temperature on the busy CPUs was lower than expected. What, when they're busier! Of course, the processor's fan takes an average of the four core temperatures and that average always looks coolish, so it doesn't run up to 100%. That's the second of the two things that, I'm suggesting, depend on good temperature data from the processor. The two together solve my quandary nicely.
So, do I have to keep tampering with each process to keep things going or in there some way I can just shut down two of my four CPUs, so they're not even considered? Does someone know the offending sensor management chip I could replace (from the chipset mentioned above)?
Also, does anyone know where I can get a more user friendly way of displaying the real-time Performance Monitoring data (.blg files) graphically? (Doing it through IE, the MS way, is OK the first time, but you loose your settings for line colour and line type (and the choice is poor) and it would be nice to tick (or check) the ones you want to view in the current session, rather than having to "redo from start" each time.
Sorry this is such a long posting, but it's such a relief because I've been fighting this problem for a year now [with some things like fans, reapplication of thermal gel and a host of temperature measurement and fan control programs (together with their bugs and workarounds) helping, but nothing actually solving the problem].
Anyway, maybe my pain will benefit someone else.
All the best,
Peter.