Cache on Hand

Edit: Link no longer works.

Originally posted September 25, 2012 on AIXchange

Chris Gibson tweeted a link to a great read that will help you get your head around the inner-workings of your Power hardware.

Here’s a snippet from the article, “Under the Hood: Of POWER7 Processor Caches.”

“Most of us have a mental image of modern computer systems as consisting of many processors all accessing the system’s memory. The truth is, though, that processors are way too fast to wait for each memory access. In the time it takes to access memory just once, these processors can execute hundreds, if not thousands, of instructions. If you need to speed up an application, it is often far easier to remove just one slow memory access than it is to find and remove many hundreds of instructions.

“To keep that processor busy — to execute your application rapidly — something faster than the system’s memory needs to temporarily hold those gigabytes of data and programs accessed by the processors AND provide the needed rapid access. That’s the job of the Cache, or really caches. Your server’s processor cores only access cache; [servers] do not access memory directly. Cache is small compared to main storage, but also very fast. The outrageously fast speed at which instructions are executed on these processors occurs only when the data or instruction stream is held in the processor’s cache. When the needed data is not in the cache, the processor makes a request for that data from elsewhere, while it continues on, often executing the instruction stream of other tasks. It follows that the cache design within the processor complex is critical, and as a result, its design can also get quite complex.”

The author goes on to describe the cache array, the store-back cache, the L3 cast-out cache and finally, cache coherence:

* “Processor cache holds sets of the most recently accessed 128-byte blocks. You can sort of think of each cache as just a bucket of these storage blocks, but actually it is organized as an array, typically a two dimension array.”

* “So far we’ve outlined the notion of a block of storage being ‘cache filled’ into a cache line of a cache. Clearly, when doing store instructions, there is a need to write the contents of some cache lines back to memory as well.”

* “For POWER7 processors, a storage access fills a cache line of an L2 cache (and often an L1 cache line). And from there the needed data can be very quickly accessed. But the L1/L2 cache(s) are actually relatively small. [Technical Note: The L2 of each POWER7 core only has about 2000 cache lines.] And we’d rather like to keep such blocks residing close to the core as long as possible. So as blocks are filled into the L2 cache, replacing blocks already there, the contents of the replaced L2 are ‘cast-out’ from there into the L3. It takes a bit longer to subsequently re-access the blocks from the L3, but it is still much faster than having to re-access the block from main storage.”

* “This is a Symmetric Multi-processor (SMP). Within such multi-core and multi-chip systems, all memory is accessible from all of the cores, no matter the location of the core or memory. In addition, all cache is what is called ‘coherent’; a cache fill from any core in the whole of the system is able to find the most recent changed block of storage, even if the block exists in another core’s cache. The cache exists, but the hardware maintains the illusion for the software that all accesses are from and to main storage.”

Much more is covered in this article, including tips you may want to consider as a Power programmer. I encourage you to read the whole thing.