Kulasko wrote: ↑2020-04-06, 15:26:44
Thank you for your replies.
HubertLamontagne wrote: ↑2020-04-02, 17:22:53
Having instructions that require multiple micro-ops doesn't necessarily mean you need microcode.
You are right, I didn't think about that.
I still have some difficulty dealing with this as it complicates the decoding stage. Presumably, decoders don't have a constant latency and output just zero or one operation per clock, but n operations. There then is some logic that takes all outputs of all decoders, ensures the correct order and arranges them for the first m operations to be pushed further along the pipeline.
Am I understanding this correctly?
Yes. But what is a micro-op really depends on your type of pipeline. For in-order pipelines, it doesn't really matter what is a micro-op or isn't. You just push the next instruction in the pipeline as soon as it doesn't stall, and it doesn't really matter if the instruction is split in multiple parts internally or not, since the execution of different parts of the instruction is tightly coupled. You're probably only going to run 2 instructions per cycle maximum anyways (afaik), and probably only have a maximum of 1 memory load per cycle, so your front end doesn't really have to handle the really crazy cases (the back-end is going to stall anyways).
Micro-ops really make much more sense for out-of-order pipelines:
- Memory operations and ALU operations are scheduled completely differently, so generally, any operation that contains both parts has to be split into 2 micro-ops. A lot of the smaller out-of-order processors start all the memory operations in-order (but they can complete out-of-order), and inversely, ALU operations can easily be wildly reordered. The memory load can easily run 10's of cycles earlier than the math operation (if the math operation is waiting for the second operand), so the result of the memory load has to be placed in a register file of some kind anyways.
- Some x86 processors have a kindof 2-in-1 micro-op where the load and the ALU part of an instruction are still coupled together instead of being 2 totally separate micro-ops (with the AMD Athlon even having a 3-in-1 load-ALU-store micro-op). Afaik, the only difference between the 2-in-1 micro-op and 2 fully separate micro-ops is that in the 2-in-1 micro-op, the temporary inner value can be placed in a different register pool than the normal virtual registers and it doesn't have to compete for the same register retirement ports. Not all x86 CPUs have the 2-in-1 and 3-in-1 micro-ops. Out-of-order RISC cpus don't have these 2-in-1 micro-ops either and simply have more capacity for micro-ops internally (recent ARMs can run 6 micro-ops per cycle).
- Any instruction that stores 2 registers at the same time needs to issue 2 micro-ops. This is because register file write ports tend to be allocated per micro-op. (although there are special cases, such as the flags register engine on x86)
- Breaking instructions into micro-ops really helps deal with the latency of each sub-part of the operation. For instance, by splitting the RETURN instruction into a memory load, a jump, and an ALU micro-op to update the stack pointer, you can run each part with optimal latency (if the branch is correctly predicted). The stack pointer update runs really fast (since it has no dependencies except for the stack pointer) which lets the updated stack pointer be available very soon to instructions downstream. Likewise, the memory load gets started pretty fast, so that the CPU can quickly lock the load address against any intervening store operation. The jump part of the instruction can easily be delayed dozens of cycles (in the case where you get cache misses). This type of flexible scheduling is impossible to do with state machines.
It's true that this can easily add a stage or two to your pipeline, since you don't really know how many micro-ops are going to come out of a block of instructions. In particular, this means you don't know your micro-op mix: maybe you get 1 load 1 store 2 alu 2 vector (in which case it all goes to separate instruction queues so it's likely you can all issue them together), or maybe you get 6 loads (in which case you're going to get a stall no matter what!). I guess you could add a specialized cache to deal with it, similar to how the Athlon caches instruction length.