LONG READ, I suggest you read thru it & "drink it in & digest it" as to what apps gain where and HOW (what is required by developers to leverage L2 cache, and yes, multithreaded application) & I cannot make it any smaller to read, not w/out omitting critical details from quotes (from RELIABLE sources):
Alec most people dont build a system to run seti@home
That's just a single example. Folding@Home's another. There are more. Especially multithreaded applications, here is what is required for a gain to occur, from AMD:
Living in a Multi-Core World: Tips for Developers
http://developer.amd.com/articlex.jsp?id=28
"
Benefits of a Separate L2 Cache
One of the potential bottlenecks in a dual-core configuration comes from the L2 cache. In a single-cache configuration, you get a performance hit when multiple threads are competing over the same data cache.
Having a separate L2 cache for each core gives you twice the cache benefit. And of course, if you have four cores, each with its own cache, that's four times the benefit.
Having these dual L2 caches gives AMD's 64-bit architecture, also known as Direct Connect Architecture (DCA) one of its key distinctions over its competitors. But simply having this architecture in place only takes you so far.
To really benefit from L2 cache separation, developers need to implement threading techniques that allow separate cores to process separate data sets, limiting cache contention and coherency problems.
For example, consider "functional threading": the first thread handles one distinct process, then passes the operation to the second thread in the pipeline, which is dependent on that data... then to the third, the fourth, and so on. While these operations can try to run in parallel, ultimately gains are limited because of contention over the same data cache.
But with "data parallel threading," you would create threads that rely on independent data sets, for example dividing a video frame into two halves. This allows concurrent threads to make full use of an individuated cache-core configuration. Also, coding your apps with an emphasis on parallel threading allows you to automatically scale up as processors begin to add even more cores to the die.
What Is NUMA?
Along the lines of an independent L2 cache, the AMD64 architecture also employs NUMA, Non-Uniform Memory Access (or Architecture, depending on who you ask). In this scenario, each processor socket has its own memory controller, shared by all cores on that processor, which is typically populated by the system with actual physical memory.
For AMD, this especially becomes important when a configuration consists of multiple multi-core processors.
For each core, some memory is directly attached, yielding a lower latency, while some is not directly attached and has a resulting higher latency. When a given thread begins processing, the OS looks at which core is running the thread and allocates physical memory to that process. This way, the data stays close to the thread that needs it, a process called "memory affinity."
The OS considers this core to be the "home processor" for the thread and tries to keep the thread running on it. This "thread affinity" or "process affinity" contributes to performance by keeping the thread from unnecessarily getting moved. Each time the thread moves over to another core, performance takes a slight hit.
--------------------------------------------------------------
* APK EDIT - this is where Win32 API function calls like SetProcessAffinity (good for single thread apps imo more) OR SetThreadAffinityMask (better for multithreaded apps on this account), help!
Especially when SPECIFICALLY TRYING TO FULLY "HAND-OPTIMIZE" AN APPLICATION FOR MULTIPLE THREAD DESIGN, EXPLICITLY (not just letting the OS handle the multiple threadwork, implicitly).
* More on this, in detail, below later, as regards dataset
cache blocking...
(It helps to stop a phenomenon known as "cache pollution" & I did a thread on that here before, look it up if necessary)
--------------------------------------------------------------
Considering that memory is the source of the most data traffic on the computer, even more than IO, this setup increases memory bandwidth at a ratio effectively the same as the number of cores. So a 4-socket server will have 4 times the memory bandwidth.
What This Means for Developers
If writing native code, do a memory allocation request for each thread. The OS will see this and handle the allocation by assigning memory in the physical bank attached to the processor on which that thread is running."
CACHE BLOCKING:
ALSO, this can help - a technique known as "Cache Blocking", see here:
http://www3.intel.com/cd/software/products/asmo-na/eng/20461.htm
Cache Blocking Technique
There are many factors that impact cache performance.
Effective use of data cache locality is one such significant factor. And the well known data cache blocking technique is used to take advantage of data cache locality. The cache blocking technique restructures loops with frequent iterations over large data arrays by sub-dividing the large array into smaller blocks, or tiles. Each data element in the array is reused within the data block, such that the block of data fits within the data cache, before operating on the next block or tile.
Depending on the application, a cache data blocking technique is very effective.
It is widely used in linear algebra and is a common transformation applied by compilers and application programmers. Since the 2nd level unified cache contains instructions as well as data, compilers often try to take advantage of instruction locality by grouping related blocks of instructions close together as well. Typical applications benefiting from cache data blocking are image or video applications where the image can be processed on smaller portions of the total image or video frame. But the effectiveness of the technique is highly dependent on the data block size, the processor's cache size, and the number of times the data is reused.
By way of example, a sample application is provided to demonstrate the performance impact of this technique (see Appendix A). Figure 2 shows the results of cache blocking with varying block sizes on the sample application. At the sweet spot around 450-460 KB tiles size matches very closely with unified L2 cache size, the application almost doubles in performance. This is only an example and the block size sweet spot for any given application will vary based on how much of the L2 cache is used by other cached data within the application as well as cached instructions from the application. Typically, an application should target the block size to be approximately one-half to three-quarters of the cache size. In general, it's better to err on the side of having too small of a block size than too large.
Additionally, the data cache blocking technique performance scales well with multiple processors if the algorithm is threaded for data decomposition. Fortunately, the fact that each block of data can be processed independently with respect to other blocks lends itself to being decomposed into separate blocks which can be processed in separate threads of execution. Figure 2.0 also shows the performance improvement of the cache blocking algorithm for two threads running on a dual processor system with two physical processors. The performance curve for two threads matches very closely the performance curve for a single processor system with the sweet spot for the block size at around 450-460 KB per thread but at approximately twice the performance. Assuming there is very little synchronization necessary between the two threads as in this example, it's reasonable to expect that the block size sweet spot would not vary significantly. Both processors have independent cache of equal size. In this case, both processors have 512KB of L2 cache available.
Since the threaded data cache blocking technique can provide significant performance opportunities on multi-processor systems, applications should detect the number of processors in order to dynamically create a thread pool so that the number of threads available for the cache blocking technique matches the number of processors. On Hyper-Threading Technology-enabled Intel processors, the application should detect the number of logical processors supported across all physical processors in the system. Be aware that a minimum block size should be established such that the overhead of threading and synchronization does not exceed the benefit from threading your algorithm.
Actual performance for a given application depends on the relative impact of L2 cache misses and their associated memory latencies induced without cache blocking. For an application that has significant execution time relative to memory latencies, the performance impact will also be reduced."
If you're asking what that is, it is my initials.
APK
P.S.=> Using multithreaded apps w/ larger L2 cache can show gains, as an example from AMD notes above, for cache coherency & contention on threads (which are called "race conditions" & I have noted it here before in fact on these forums) which are the more prevalent application type out there today, multithreaded ones & you can even check this yourself on YOUR system.
(AND, a good 90-100% of what you're running nowadays, visible via taskmgr.exe & its PROCESSES tab with the THREADS column visible will show you this)...
games/office/photoshop and the like dont need large cache, infact 256k gives them PLENTY with an a64 chip.
Incorrect on 1 of them (they ALL get gains, gaming shows the least) but photoshop can though, but not as large but, can be larger via using EXPLICIT multithreaded design techniques for L2 cache usage as noted above!
Photoshop iirc, already is designed with EXPLICIT SMP OPTIMIZATIONS (not just implicit multiple thread use driven via the OS only)...
However, business apps tend to get gains... they are part of an ENTIRE CLASS of apps that gain via larger L2 cache levels.
Business apps will gain (i.e.-> Anything that repetitively uses the same instructions over & over on data)... and, so does what I do: CODING.
During Linux kernel core recompiles, for example (I like to put those up, not just my own words), you can see it here:
http://www.linuxhardware.org/article.pl?sid=01/06/11/1847213&mode=thread
"It would seem that compilation is very L2 heavy. Further proof of this is that the overclocked Duron was unable to beat the 850 Athlon. Still, the difference is not large enough to be excessive."
Apps like photo processing won't though as much (UNLESS things like splitting frames in 1/2 is used as noted above from AMD & also from INTEL above) but, still do:
This website gives a good overview of level 2 cache
http://www.karbosguide.com/books/pcarchitecture/chapter11.htm
"Level 2 cache is most important for processor intensive applications
such as distributed computing. Video editing, 3d studio max or sound conversions are also very processor intensive applications."
L1 and L2 cache are important components in modern processor design. The cache is crucial for the utilisation of the high clock frequencies which modern process technology allows. Modern L1 caches are extremely effective.
In about 96-98% of cases, the processor can find the data and instructions it needs in the cache. In the future, we can expect to keep seeing CPU’s with larger L2 caches and more advanced memory management. As this is the way forward if we want to achieve more effective utilisation of the CPU’s clock ticks. Here is a concrete example:
In January 2002 Intel released a new version of their top processor, the Pentium 4 (with the codename, “Northwood”). The clock frequency had been increased by 10%, so one might expect a 10% improvement in performance.
But because the integrated L2 cache was also doubled from 256 to 512 KB, the gain was found to be all of 30%.
Fig. 79.
Because of the larger L2 cache, performance increased significantly.
In 2002 AMD updated the Athlon processor with the new ”Barton” core. Here the L2 cache was also doubled from 256 to 512 KB in some models. In 2004 Intel came with the “Prescott” core with 1024 KB L2 cache, which is the same size as in AMD’s Athlon 64 processors. Some Extreme Editions of Pentium 4 even uses 2 MB of L2 cache."
MORE REINFORCING EXAMPLES & CASES OF WHERE L2 CACHE SHOWS GAINS/BENEFITS:
http://www.anandtech.com/cpuchipsets/showdoc.aspx?i=2795&p=4
A small paragraph from their page
"The 4MB L2 cache can increase performance by as much as 10% in some
situations.
Such a performance improvement is definitely tangible, and
as applications grow larger in their working data sets then the
advantage of a larger cache will only become more visible.
If you're the type to upgrade often, then the extra cache is not worth
it as you're not getting enough of a present day increase in
performance to justify the added cost.
However, if this processor will be the basis for your system for the next several years, we'd strongly recommend picking a 4MB flavor of Core 2."
I believe though that if you follow these rules of thumb and things to
think about you should be ok.
1 Can you afford and justify the extra cost of the CPU?
2 What will you be doing with the PC? Office, solitare, email, web
browsing and other non CPU intensive applications - E6400.
Gaming, sound editing, picture editing, video encoding and other CPU
intensive applications - E6600 (you have a longer shelf life with this
CPU too)."
Apps with only TINY gains? Games (but, gains nonetheless result here too) typically... so, again: It really ALL boils down to what YOU DO on a PC... as I stated initially.
A lot to read, no doubt, but it all shows when/where/how applications can be coded to leverage added L2 cache... apk