forwardcom forum

Posted: **2021-03-12, 16:52:05**

Looking at the problem of how to build fast CPU cores, I've come to the conclusion that the key component that differentiates the big boys is to have an L1 data cache that supports rollbacking operations that haven't graduated/committed. The simpler CPUs that have this kind of L1 Cache (such as the MIPS R10000 which issues memory operations in-order) look something like this:

Stage 1: Address generation. This is simply adding an address register to some offset (typically an immediate or a scaled index register).

Stage 2: Data cache lookup & store locking. Load operations attempt to load the data from the L1 cache. For CPUs that have a physically indexed TLB, this is also where the tlb lookup happens. For store operations, the calculated address is added to a lock buffer in order to stall any load operations that try to hit the same address.

Stage 3: Data massage. 16bit and 8bit load operations (and 32bit load operations on 64bit cores) have their data shifted the appropriate number of bytes, and zero-extended or sign-extended depending on if it's an unsigned or a signed load. On x86, this is applied to all load operations and is applied to the whole cache line. Some cores (AMD) zero the loaded data if unprivileged code tries to load privileged data to prevent Meltdown-style exploits here. If the data is in L1 and there are no alignment problems and there is no TLB miss and the load address isn't locked by a pending store operation, the loaded data is valid and can be used by the rest of the CPU and can be written to the register file.

Stage X: Refill. For load operations where something goes wrong, the L2 and memory controllers spring into action. This includes doing TLB walks for cores that have a hardware-filled TLB which potentially includes writing back evicted TLB entries (since the "accessed" flag can have changed), dealing with split unaligned memory accesses (possibly stealing some L1 access cycles?), evicting L1 cache lines and writing out contents if modified, loading in new L1 data, streaming in new data from L2/memory, or stalling until a pending memory write resolves (for reads that fall on the same address as a pending write operation). This can also include dealing with store operations that have L1, alignment or TLB misses (or address conflicts on some cores). Once this happens, the refilled value becomes valid and can be written to the register file.

Stage Y: Write value. For store operations, the value to be written is read from the register file. Some CPUs have conditional store operations such as CMOV, and the condition bit is presumably also read in this stage.

Stage Z: Graduation/Commit. The CPU's back end determines that all the previous operations have completed successfully. The memory operation either stalls (if still pending), or is deemed to be either a success, or has triggered an exception (generally a page fault), which allows or disallows further operations from running. For load operations, the temporary register file value becomes the permanent value for this register, and the previous permanent register can now be reused for new renamed register values. For store operations, the L1 cache is updated with the new value, and the cache line is flagged as "dirty", and the lock on the memory address is removed. Some CPUs also check that the store operation hasn't affected any memory that's in the instruction cache or prefetch queue (which triggers a rollback and an instruction cache purge).

Stage W: Rollback. This happens when branch prediction has been incorrect (including on page faults). All pending operations are purged, and register assignments are reset back to the permanent register values. Store operation memory locks are also removed.

---

All of this gets quite complex. My question here is, is there anything in the instruction set that could be designed to make this whole thing simpler, since that's one of the major sources of complexity in fast CPU cores? I guess forwardcom's simplification of the TLB falls into this category, and some other things that have been done include limiting unaligned memory accesses (on the DEC Alpha). Itanium tried to do this in software with a variety of specialized load operations (checked loads, advanced loads, checked advanced loads and a couple others), but it seems this has failed in practice (the compiler needed a crystal ball to guess how to do this correctly and it ended up being more complex in the end). The Transmeta Crusoe also did similar software speculation, but in a simpler form. Any inspiration on how to tackle this problem?

Posted: **2021-03-13, 7:17:19**

Thank you Hubert for the explanation. It is a relevant discussion how caching can be made simpler.

The ForwardCom design may have restrictions on alignment. Unaligned memory accesses could be split into two, or simply not allowed. The memcpy library function would need to shift data if source and destination are differently aligned.

Very long vectors should preferably be aligned by the maximum vector length. This will make access faster if each vector lane has a fast-track databus to a part of the cache that fits this alignment. Unaligned long vector accesses would either need a huge barrel shifter (which is expensive) or proceed piecewise. Unaligned vector push operations can store the data in a rotated way that fits the optimal alignment. Each vector lane would use the part of the cache that fits its fast-track bus rather that the part that a linear store would require. A subsequent pop operation will undo this rotation. This is possible because ForwardCom allows vector push and pop operations to store data in an implementation dependent way.

Conditional or masked stores should be possible with byte granularity. This requires hardware support.

Caching will be simpler if each CPU core has its own private cache (no hyperthreading). It is possible to give each thread in the same process a private memory space not accessible to other threads. In fact, this is the preferred memory configuration. But then there is a problem about how to communicate efficiently between threads. Could we invent a special inter-thread communication method and leave all private memory access to a more efficient private cache? A very efficient way would be to have a hardware FIFO buffer that one thread can write to and another thread can read from. This would require existing software to be radically redesigned if this is the only way to communicate between threads.

A less radical solution might be to have one cache for thread-private data and another cache for shared data. The private cache will be simpler and more efficient. We may still have the FIFO buffer as an efficient alternative to mutexes, semaphores, and the like, while legacy software will use a less efficient method. Some modern CPUs use speculative execution to speed up inter-thread synchronization, but this is an expensive solution that I think we should avoid.

Posted: **2021-03-14, 22:03:09**

Perhaps the way unaligned loads/stores could be handled is through an alignment predictor... All memory loads/stores are initially predicted to be aligned, and if a memory operation ends up being unaligned, it triggers a branch prediction fail and the load/store operation is recorded in the alignment predictor (with some small number of included addresses), and as the CPU retries the load/store, it does a two-part split access instead. And you could have opcodes for unaligned memory load operations, which are always treated with a two-part split access (and presumably issuing 2 micro-ops). The compiler can probably figure out which accesses are unaligned most of the time (generally when #pragma pack(push, 1) or void*/char* -> short*/int*/float*/etc casts show up). Some CPUs have a 2-instruction pair for known unaligned loads (Alpha or Mips or PA-RISC I think?)... something like LoadUnalignedLow + LoadUnalignedHigh.

Indeed, most multithread software is written to use shared memory to communicate between threads, with mutexes on the shared data objects to prevent surprises (the mutexes contain memory barriers). Most architectures have evolved towards the so-called "Weak memory order" or "Total store order" for memory ordering (see https://en.wikipedia.org/wiki/Memory_ordering for an explanation... ARM and C++ compilers use Weak Memory Order, x86 uses Total Store Order). This also means using MESI or MOESI or MESIF cache coherency protocols between cores.

Generally the problem with a "FIFO-only" model of threading is that it's difficult share a large amount of data (think 100+ Mb) with that kind of mechanism, although it does exist in some places (ex: worker threads in Javascript, because the dynamic execution engine has to run single-threaded due to the complexity and unpredictable lifespan of objects). Often, the thread that loads a piece of data is different from the thread that uses the data (data streaming threads etc).

I think hyper-threading significantly easier to implement than out-of-order execution... is this right?

As for the idea of having a different cache for thread-specific data and for shared data, that actually really makes sense, because you could place the stack in thread-specific memory (programmers are generally sensible enough not to share stack data across threads!) and malloc()-allocated memory in shared data, and you can probably differentiate between the two areas pretty easily from high address bits... I'm not sure what the load balance between the two areas would look like in typical applications but it might end up being pretty balanced... The compiler probably knows, for a lot of memory accesses (but not all!), if the access falls on the stack or in malloc()-memory (or if it's some sort of global data or thread-local-storage).

Posted: **2021-03-15, 7:15:08**

In most cases, the compiler will know whether memory is aligned or not. Standard functions like memcpy are checking whether the pointers are aligned before it decides which method is optimal. Shifting data to make it aligned can be done in software. This may be inconvenient for the programmer, but it is possible.

I don't like hyperthreading. I have seen a low priority thread stealing resources from a high priority thread sharing the same CPU core. It is too complicated for the operating system to decide which threads will compete for resources with each other and which ones can optimally share the same core. And we cannot reasonably leave this decision to the programmer because it will be different for each CPU model. The biggest CPUs today have so many cores that I don't think hyperthreading makes sense, except for a server that runs many threads. I don't think hyperthreading is optimal if cache size is a limiting resource, which is often the case.

Posted: **2021-03-19, 15:14:07**

In most cases, yes. But not in all cases... You can always pass unaligned pointers to other functions... Though I guess that kind of case can always be handled with a Memory Aliasing Fault, and letting the OS do the split memory access for you and cobbling together the result... Slow, but should happen extremely rarely.

Yeah I see your point of view on hyperthreading... You're right that it's not so nice on out-of-order desktop processors since you're handing over up to half your cache to BonziBUDDY or who-knows-what... And considering that normally your OoO cpu should be good enough at reducing pipeline bubbles so you don't need another aggressive mechanism to keep the pipeline stages reasonably full.

Forcing store operations to never be speculative simplifies a lot of these tasks (it kinda turns the cpu into a partially in-order CPU?) but I'm not sure I can come up with a kind of architecture that could do this without causing huge stalls...

Posted: **2021-03-19, 16:44:14**

Functions that receive a pointer can check if it is aligned. Functions like memcpy do that. But it is unrealistic to require that all functions have multiple paths for aligned and unaligned pointers. In most situations you can require that pointers be aligned according to the data size. Alignment of vectors is more problematic.

BTW. I just discovered that the Latest Intel processors have 5 levels of page tables. (See https://en.wikipedia.org/wiki/Intel_5-level_paging ). As the complexity of TLB and paging systems keeps expanding, it looks like it is time to look for simpler and more efficient alternatives.

Variable-size pages is a logical solution, if only we can avoid too much memory fragmentation. I wonder if it is possible to get rid of memory mapping of files and devices?

Posted: **2021-03-19, 20:19:48**

HubertLamontagne wrote: ↑2021-03-19, 15:14:07 Forcing store operations to never be speculative simplifies a lot of these tasks (it kinda turns the cpu into a partially in-order CPU?) but I'm not sure I can come up with a kind of architecture that could do this without causing huge stalls...

Assuming no other core is dependent on the data of a store, wouldn't a big store buffer just do the trick? Later loads can check the store buffer and when assuming private memory, writing back becomes not a latency but a throughput issue.

agner wrote: ↑2021-03-19, 16:44:14 Variable-size pages is a logical solution, if only we can avoid too much memory fragmentation. I wonder if it is possible to get rid of memory mapping of files and devices?

Of course, that needs support in the operating system and maybe also the devices. As the current ForwardCom specification uses port mapped I/O (afaik), a specific thing to think about would be if it's possible to map out PCIE lanes/devices with it.

Posted: **2021-03-21, 15:46:23**

Kulasko wrote: ↑2021-03-19, 20:19:48
HubertLamontagne wrote: ↑2021-03-19, 15:14:07 Forcing store operations to never be speculative simplifies a lot of these tasks (it kinda turns the cpu into a partially in-order CPU?) but I'm not sure I can come up with a kind of architecture that could do this without causing huge stalls...
Assuming no other core is dependent on the data of a store, wouldn't a big store buffer just do the trick? Later loads can check the store buffer and when assuming private memory, writing back becomes not a latency but a throughput issue.

Yes of course, but in that case you're building a fully out-of-order CPU in the style of the MIPS R10000 / PA-RISC / DEC Alpha / Pentium 2 etc... It's a classic design and it's pretty fast, but it's pretty complex since you need to be able to rollback literally everything (interrupts, stores, loads, register writes, etc), you have to schedule everything using register availability, you need to have some sort of unused-register garbage collection, etc...

forwardcom forum

Rollbackable L1 Data Cache Design?

Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?