Rollbackable L1 Data Cache Design?
Posted: 2021-03-12, 16:52:05
Looking at the problem of how to build fast CPU cores, I've come to the conclusion that the key component that differentiates the big boys is to have an L1 data cache that supports rollbacking operations that haven't graduated/committed. The simpler CPUs that have this kind of L1 Cache (such as the MIPS R10000 which issues memory operations in-order) look something like this:
Stage 1: Address generation. This is simply adding an address register to some offset (typically an immediate or a scaled index register).
Stage 2: Data cache lookup & store locking. Load operations attempt to load the data from the L1 cache. For CPUs that have a physically indexed TLB, this is also where the tlb lookup happens. For store operations, the calculated address is added to a lock buffer in order to stall any load operations that try to hit the same address.
Stage 3: Data massage. 16bit and 8bit load operations (and 32bit load operations on 64bit cores) have their data shifted the appropriate number of bytes, and zero-extended or sign-extended depending on if it's an unsigned or a signed load. On x86, this is applied to all load operations and is applied to the whole cache line. Some cores (AMD) zero the loaded data if unprivileged code tries to load privileged data to prevent Meltdown-style exploits here. If the data is in L1 and there are no alignment problems and there is no TLB miss and the load address isn't locked by a pending store operation, the loaded data is valid and can be used by the rest of the CPU and can be written to the register file.
Stage X: Refill. For load operations where something goes wrong, the L2 and memory controllers spring into action. This includes doing TLB walks for cores that have a hardware-filled TLB which potentially includes writing back evicted TLB entries (since the "accessed" flag can have changed), dealing with split unaligned memory accesses (possibly stealing some L1 access cycles?), evicting L1 cache lines and writing out contents if modified, loading in new L1 data, streaming in new data from L2/memory, or stalling until a pending memory write resolves (for reads that fall on the same address as a pending write operation). This can also include dealing with store operations that have L1, alignment or TLB misses (or address conflicts on some cores). Once this happens, the refilled value becomes valid and can be written to the register file.
Stage Y: Write value. For store operations, the value to be written is read from the register file. Some CPUs have conditional store operations such as CMOV, and the condition bit is presumably also read in this stage.
Stage Z: Graduation/Commit. The CPU's back end determines that all the previous operations have completed successfully. The memory operation either stalls (if still pending), or is deemed to be either a success, or has triggered an exception (generally a page fault), which allows or disallows further operations from running. For load operations, the temporary register file value becomes the permanent value for this register, and the previous permanent register can now be reused for new renamed register values. For store operations, the L1 cache is updated with the new value, and the cache line is flagged as "dirty", and the lock on the memory address is removed. Some CPUs also check that the store operation hasn't affected any memory that's in the instruction cache or prefetch queue (which triggers a rollback and an instruction cache purge).
Stage W: Rollback. This happens when branch prediction has been incorrect (including on page faults). All pending operations are purged, and register assignments are reset back to the permanent register values. Store operation memory locks are also removed.
---
All of this gets quite complex. My question here is, is there anything in the instruction set that could be designed to make this whole thing simpler, since that's one of the major sources of complexity in fast CPU cores? I guess forwardcom's simplification of the TLB falls into this category, and some other things that have been done include limiting unaligned memory accesses (on the DEC Alpha). Itanium tried to do this in software with a variety of specialized load operations (checked loads, advanced loads, checked advanced loads and a couple others), but it seems this has failed in practice (the compiler needed a crystal ball to guess how to do this correctly and it ended up being more complex in the end). The Transmeta Crusoe also did similar software speculation, but in a simpler form. Any inspiration on how to tackle this problem?
Stage 1: Address generation. This is simply adding an address register to some offset (typically an immediate or a scaled index register).
Stage 2: Data cache lookup & store locking. Load operations attempt to load the data from the L1 cache. For CPUs that have a physically indexed TLB, this is also where the tlb lookup happens. For store operations, the calculated address is added to a lock buffer in order to stall any load operations that try to hit the same address.
Stage 3: Data massage. 16bit and 8bit load operations (and 32bit load operations on 64bit cores) have their data shifted the appropriate number of bytes, and zero-extended or sign-extended depending on if it's an unsigned or a signed load. On x86, this is applied to all load operations and is applied to the whole cache line. Some cores (AMD) zero the loaded data if unprivileged code tries to load privileged data to prevent Meltdown-style exploits here. If the data is in L1 and there are no alignment problems and there is no TLB miss and the load address isn't locked by a pending store operation, the loaded data is valid and can be used by the rest of the CPU and can be written to the register file.
Stage X: Refill. For load operations where something goes wrong, the L2 and memory controllers spring into action. This includes doing TLB walks for cores that have a hardware-filled TLB which potentially includes writing back evicted TLB entries (since the "accessed" flag can have changed), dealing with split unaligned memory accesses (possibly stealing some L1 access cycles?), evicting L1 cache lines and writing out contents if modified, loading in new L1 data, streaming in new data from L2/memory, or stalling until a pending memory write resolves (for reads that fall on the same address as a pending write operation). This can also include dealing with store operations that have L1, alignment or TLB misses (or address conflicts on some cores). Once this happens, the refilled value becomes valid and can be written to the register file.
Stage Y: Write value. For store operations, the value to be written is read from the register file. Some CPUs have conditional store operations such as CMOV, and the condition bit is presumably also read in this stage.
Stage Z: Graduation/Commit. The CPU's back end determines that all the previous operations have completed successfully. The memory operation either stalls (if still pending), or is deemed to be either a success, or has triggered an exception (generally a page fault), which allows or disallows further operations from running. For load operations, the temporary register file value becomes the permanent value for this register, and the previous permanent register can now be reused for new renamed register values. For store operations, the L1 cache is updated with the new value, and the cache line is flagged as "dirty", and the lock on the memory address is removed. Some CPUs also check that the store operation hasn't affected any memory that's in the instruction cache or prefetch queue (which triggers a rollback and an instruction cache purge).
Stage W: Rollback. This happens when branch prediction has been incorrect (including on page faults). All pending operations are purged, and register assignments are reset back to the permanent register values. Store operation memory locks are also removed.
---
All of this gets quite complex. My question here is, is there anything in the instruction set that could be designed to make this whole thing simpler, since that's one of the major sources of complexity in fast CPU cores? I guess forwardcom's simplification of the TLB falls into this category, and some other things that have been done include limiting unaligned memory accesses (on the DEC Alpha). Itanium tried to do this in software with a variety of specialized load operations (checked loads, advanced loads, checked advanced loads and a couple others), but it seems this has failed in practice (the compiler needed a crystal ball to guess how to do this correctly and it ended up being more complex in the end). The Transmeta Crusoe also did similar software speculation, but in a simpler form. Any inspiration on how to tackle this problem?