NITTech: Cache

A place to hide or conceal something (dict)

A cache is a place to store a block of memory which would be used again and again.

REASON:
The reason for having cache is that cost of faster memories are high and to cater to the speed of CPU we require faster memories. So instead of having the entire memory as faster memory (since it is costly) we have smaller faster memories and bring in data whenever required into this faster memory from slower memory. Thus reduce the average access time for data.

ASSUMPTIONS:
Based on Principle of Locality

Spatial Locality - Data that is near the currently accessed data has high probability of getting accessed
Temporal Locality - The same data would be accessed again and again (temporary variables, Instructions getting executed etc)

BASIC BLOCK DIAGRAM:

CPU<=>Level 1<=>Level 2<=>Level 3
------->Memory size Increases-------->
<-------Speed increases <-------------- MECHANISM:
Initially all the data is in Level 3 Cache. Whenever a data needs to be accesses by the CPU it checks for the availability in Level 1 (if found/cache hit). If data not available (cache miss) then it goes to higher level cache (Level2). Then the same till it reaches the data. Once obtained the data along with some of it neighbors are brought into all the lower level caches till the memory.
So during the first access, there is latency. For the subsequent access of the same data (since available in the cache) it is faster - Temporal Locality. Further, for access of data near the first accessed data the access speed is faster - Spatial Locality. Thus on an average it reduces the access time.

TERMS:
Cache Hit: Data requested by the CPU is present in that particular level
Cache Miss: Data requested is not present in the particular level
Associativity: The way in which the cache is organized - e.g. Direct Mapping, Set Associativity
Cache Line: Size of data read from higher level in one read into the current level

Associativity:
How main memory maps to the cache memory location.

Direct Mapping:
Caches can be either "direct-mapped" or "set-associative." To explain these terms, let's look the L1P cache of C64x shown in Figure 4. This cache consists of 512 32 byte lines. Each line always maps to the same fixed addresses in memory. For instance:

Addresses 0000h to 0019h will always be cached in line 0
Addresses 0020h to 0039h will always be cached in line 1
Addresses 3FE0h to 3FFFh will always be cached in line 511.

Once we get to address 4000h, the capacity of the cache is exhausted so addresses 4000h to 4019h start back over at line 0.

n addition to holding copies of data, each line of the L1P cache contains:

A valid bit, which indicates whether or not the cache line contains useful data.
A tag, which is equivalent to the upper 18 bits of the address. This is needed because a given line can contain data from different addresses. For example, line 0 can hold data from addresses 0000h to 0019h, addresses 4000h to 4019h, etc.
An (implied) set number. The set number is equivalent to bits 5 to 13 of the address. For a direct-mapped cache, the set number is the same as the line number. Things get more complicated with a set-associative cache, as we will se later in this article.

Now let's see what happens if the CPU accesses address location 0020h. Assume that the cache has been completely invalidated, meaning that no line contains cached data. When the CPU makes a request to read address 20h, the cache controller starts by looking at the set portion of the address (e.g. bit 5 to 13) to see which cache line it should inspect. For address 0020h the set portion is one. The controller then checks the tag bit for line one to see if the line holds addresses 0020h to 0039h, which it does. Finally, it checks the valid bit to see if the line holds useful data. Since the valid bit is assumed to be 0, the controller registers a miss.

The miss causes the controller to fetch the line (0020h-0039h) from memory and store the data in set one. The tag portion of the address is stored in the tag RAM, and the valid bit is set to one. The fetched data is also forwarded to the CPU, and the access is complete.

If address 0020h is accessed again, the cache controller once again takes the address, inspects its set and tag portions, and compares the tag portion of the address against the value stored in tag RAM. In this case the tag comparison is positive. The valid bit is now one so the controller will register a hit and forward the data in the cache line to the CPU, completing the access.

Set Associative Mapping:
Set-associative caches are an extension of direct-mapped caches. In a direct-mapped cache, each set contains only one line. In a set-associative cache, each set consists of multiple lines, which are referred to as "ways." Figure 5 illustrates a set-associative cache, the C64x DSP's L1D. This is a two way set associative cache with 64 byte lines and a total 16 KBytes capacity.

n addition to holding copies of data, each line of the L1D cache contains:

A LRU bit that indicates which way was the least recently used. (This is not found in L1P.)
A dirty bit, which indicates whether or not the cache line matches the contents of main memory. (This is not found in L1P.)
A valid bit, which indicates whether or not the cache line contains useful data.
A tag, which is equivalent to the upper 18 bits of the address.
A set number, which is equivalent to bits 5 to 13 of the address.

Hits and misses are determined the same way as in a direct-mapped cache except that now two tag comparisons are necessary—one for each way—to determine which way holds the requested data. If there is a hit in way 0, the data of the line in way 0 is accessed, and if there is a hit in way 1, the data of the line in way 1 is accessed.

If both ways miss, the data needs to be allocated from memory. The LRU bit determines how the data is allocated and can be thought of as a switch. If the LRU bit is set to 0 the line is allocated in way 0, and if it is 1 the line gets allocated in way 1. The state of the LRU bit changes whenever an access (read or write) is made to a cache line. For instance, if the line in way 0 is read, the LRU bit switches to 1. Note that the LRU bit is only consulted on a miss, but its status is updated every time a line is accessed regardless whether it was a hit, miss, read or write.

As the L1P, the L1D is a read-allocated cache where new data is allocated from memory on a read miss only. On a write miss, the data goes through a write buffer to memory, bypassing L1D cache. On a write hit, the data is written to the cache but not immediately to memory. This type of cache is referred to as write-back cache since data that was modified by a CPU write access is written back to memory at a later time.

The dirty bit indicates that a cache line has been modified by a write but the new data has not been written out to main memory. Initially the dirty bit will be zero. As soon as the CPU writes to a line, the corresponding dirty bit will be set to 1. The data is then written back to main memory when the line is evicted from the cache. This will happen when a read miss causes new data to be allocated to the dirty line. A write-back can also be initiated in software by sending a write back command to the cache controller. However, this is not usually required.
Tag, Cache Line, Cache Line Size and Cache Size would be clear from the above diagram.

Optimizing Cache Performance
There are three types of misses:

Compulsory miss (also called first reference miss) - a miss that occurs when the data is brought in cache for the first time. In contrast to the other two misses above, they can usually not be avoided.
Conflict miss - a miss that occurs because the line is replaced before it is re-used.
Capacity miss - a miss that occurs when the capacity is exhausted. Capacity misses are a subset of conflict misses.

For each miss, there will be stalls while the controller pulls data from memory into cache. To get the highest performance, a line must be reused as often as possible before it is replaced with another line. Reusing the line but accessing different locations within that line improves the spatial locality of accesses, while reusing the same locations of a line improves the temporal locality of accesses. This is the fundamental strategy of optimizing memory accesses for cache performance.

For example, cache performance is high when a program accesses memory sequentially. This access pattern makes many accesses to the same line before moving on to the next line. For example, the following pseudo-code has a very high hit rate:

The performance of the following code is not good because accesses to memory are in large strides, which means it makes fewer accesses to a given line before moving to the next line:

If a line is evicted from cache and then accessed again, the line must be brought into cache again. Therefore, it is important to avoid eviction as long as possible. Identifying the cause of a miss helps prevents a future miss.

As noted above, capacity misses occur because the cache is smaller than main memory. If there are capacity misses, the simplest solution is to increase the size of the cache. For example, the L2 cache on the C64x DSP can be configured as a mix of SRAM and cache. If there are excessive capacity misses, the programmer can allocate more L2 memory to cache. Another solution is to reduce the amount of data that is needed at any given time.

If there are conflict misses, the key is to reorganize data so that temporally proximate data sets map to different sets. (With a direct-mapped cache, this is the same as making sure the data sets map to different lines.) Changing the memory layout allows the data accessed to be located at addresses in memory that do not conflict (i.e. map to the same set) in cache. Alternatively, from a hardware design point of view, sets that can hold two or more lines can be created. Thus, two lines from the memory that map to the same set can both be allocated in cache without having one evict the other.

NITTech

Wednesday, May 30, 2007

Cache

No comments:

My Blogs

Labels

Blog Archive

About Me