Handshaking between CPU pipeline stages

Post by **agner** » 2020-05-16, 12:54:49

I am thinking about how to coordinate the processing between the different stages in the pipeline. I want to know if I am doing the right thing or if there is a better way of doing this.

Here is my idea: Each stage in the pipeline is a module with the following inputs and outputs:

stall_in: Tells the module for this stage to keep its output data for another clock cycle because the rest of the pipeline is unable to receive it in the next clock cycle.
stall_out: This stage is unable to receive a new instruction in the next clock cycle because it is waiting for an operand or it is busy with a task that takes multiple clock cycles in the same stage.
idle_out: This stage contains a pipeline bubble or an instruction with a zero mask or anything else that requires no action.
ready_out: This stage has an instruction ready for the next pipeline stage.
ready_in: Connected to the ready_out of the preceding pipeline stage.
and of course inputs and outputs of an instruction and its data.

The stall_out signal is not registered, unlike other outputs, because it is difficult to predict a stall in a future clock cycle. The stall_out must be implemented with simple logic (1-2 LUTs?) so that it is ready early in the clock cycle.

It would be logical to propagate the stall signal backwards through the pipeline, but this would be too slow for stalling the entire pipeline in a single clock cycle. Instead, I will make a look-ahead logic in the top level module. This logic will give each pipeline stage a stall_in signal if any later stage in the pipeline generates a stall_out signal and no stage in-between has an idle_out.

Is this a good way of doing it? Will it work?

Kulasko · Post by **Kulasko** » 2020-05-18, 2:45:03

As I don't have any experience concerning multi-stage pipelines in HDLs, I can't really say if your proposal is a good way of doing it. Looking at it, it should work, I might have some concerns with timing/routing, as every stage has to be able to send a stall_out signal to a central module and receive a stall_in after calculation, all in the same clock cycle.

As I had to implement stalling support into my CPU simulator, I want to explain an HDL-adapted version of it as an alternative, maybe you can take some inspiration from this:

Modules have

inputs and outputs for instruction data
a stall_in signal that tells them not to commit their result to the next stage in the current clock cycle
a stall_out signal for them to tell if they are unable to use their inputs in the current cycle

The actual stall logic is implemented in the registers at the stage-boundaries. A stage boundary consists of

the inputs from the previous stage and the outputs to the next stage
a stall_in signal it receives from the stall_out signal of the next stage
a stall_out signal it sends to the stall_in of the previous stage
the main registers, they forward their contents to the next stage, but only clock if stall_in is low
a flip-flop that takes stall_in, delays it by a clock cycle and sends it to stall_out
a second set of buffer registers, they forward their content to the main registers and clock only one time when stall_in changes to high
a multiplexer that forwards either the data input to the main registers if the delayed stall signal is low, or the contents of the buffer registers

When a stall occurs, the current output from the previous stage is written to the buffer registers instead of the main ones. Because of this, the previous stage can continue normal operation for one more clock cycle. During the clock the stall is resolved, the main registers receive the content from the buffer instead of the previous stage, as it is stalled for one more cycle.

The clock condition of the buffer registers can be simplified to always clocking if the delayed signal is low, at the cost of increased power draw.
This design also always incurs a stall penalty of one clock cycle if the previous stage sends an instruction for the first time since the next stage stalled, and in the same clock the stall is resolved. This can be mitigated by gating off the delayed clock signal in the boundary until the buffered registers get data.

The information if there is any data in the registers can be obtained either by checking if certain bits are zero (all zeroes in a instruction word is a nop in ForwardCom) or by saving a seperate signal in the registers.

One main advantage of this approach is the relaxed timing requirements. Except the time needed to calculate the boundary controls, stall calculation can take the entire clock cycle. The other big advantage is having only one extra signal between modules, and only between stages that are connected anyway.
One of the disadvantages is that a multiplexer is needed in what is likely to be the critical path, thus decreasing the maximum clock speed. The other is potentially needing a lot of gates as be boundary registers are doubled. I'm not sure if this applies to FPGAs as well, as registers might be implemented much denser than logic. Also, if disabling/gating a clock for a register is not supported, workaround logic is needed.

Webenson Diolen · Post by **Webenson Diolen** » 2020-05-19, 19:50:43

Hello Agner, I have never build a pipelined CPU nor a FPGA-Implementation of it, but to me it looks like the solution you are working on might
put you into a situation where you end up with far too much synchronisation of the pipeline stages. In the end, you are prepared for many
cornercases, but I don't know if your concept scales if your pipeline had more than a few pipeline stages.

I would try to keep the synchronisation of the pipeline stages as trivial as possible. As far as I am informed, you have some ambitions
to work into the direction of an Out-of-order machine. This means that dispatching of instructions could get quiet complex.

Why not dispatch only those instructions that are really "ready"??? If all stages in the pipeline are time constant, and this independent of their input,
there are very little cornercases. Now the only cornercase left is that you simply have no ready instruction to dispatch.

I have made a little drawing of my idea, but I am not able to insert the image into my post here. Nothing great, but enough to get the
intended control flow.

Now for a moment please consider this situation: There are essentially only two signals between the pipeline stages and the only way they
travel is upward from the dispatching unit. One big data-field that the dispatcher pushes upward. It contains everything for each
pipeline stage to do its task. And one bubble signal. If the dispatcher pushes a bubble, thats a bubble. Else the data in the data-field will
be processed in the stages. Of course, the dispatcher must be able to encode a signal which of the stages are to process on each travelling block.

You can make this data-field very wide, lets say the size of 192bits for the content of 3 registers of 64bits each and some more data to
specify the instruction that travels through the pipeline. If you want to go superscalar, expand this values and schedule on specific
function units.

From my perspective its much easier to bundle a set of data and pass it through a pipeline only if you are really ready to process it.
Because the dispatcher might know what he dispatched a cycle earlier, he can dispatch the following instructions with a reference to
the result that will be calculated ealier, but isn't yet through the full pipeline lenght.

This involves shadow-registers in some constallations, like two addition instructions on the same arithmetic unit in direct sequence, but in my
opinion this drawback is no bad, since you get rid of some special features of register renaming in parallel. Maybe this will be an issue
some time later.

The cornercase of jumps, branches and calls was not forgotten. But it is still the same problem. If a branch takes effect, this is the very
moment a signal must be sent into the opposite direction to the dispatching unit. So the pipeline stage that checks the condition for jumps,
needs to have a wire back to the dispatcher. If your branch predictor was wrong, you flush the pipeline now.

I hope that my written input is helpful to you and that you can find a solution that you can scale to your bandwidth needs.

Post by **agner** » 2020-05-20, 5:04:43

Thank you for your responses.

The problem I am trying to tackle is if some stage in the pipeline is stalled for whatever reason, it has to stop all preceding stages in the pipeline immediately. Kulaskos idea of a buffer between each stage would work, but it would require a lot of extra flip flops and multiplexers. Each instruction may contain up to three 64-bit operand values, as Webenson correctly observes, as well as mask and operation type.

Webensons idea presupposes that we have multiple execution pipelines and a dispatcher that knows which pipeline is vacant. The problem is if an unexpected stall occurs at a late stage in the pipeline, for example a memory write contention. We need a method for predicting such stalls and stop feeding the pipeline. I think we may have to invest some extra logic in predicting such stalls one clock cycle before they occur. But would such logic be bulletproof?

I have actually thought about the idea of multiple pipelines. We may have one pipeline for vector register instructions and two pipelines for integer instructions. One of these could be a fast track pipeline for simple integer instructions with no memory operands. This would include most loops and branches and incrementing loop counters. The short fast track pipeline would reduce the delay of branches by making it known in a few clock cycles whether it will branch or not.

Webenson suggests that we can service, for example, consecutive additions without register renaming. My idea for this is that the decoder assigns a tag to each instruction and puts the same tag in the register file on the register that is going to receive the result of this instruction. For example, if we have R3 = R1+R2; R5 = R3 + R4 we will have the tag of the first instruction on R3. The second instruction will have the tag for R3 rather than the value of R3 while it is traveling through the pipeline. It will watch the result bus and sample the value of R3 when it sees a matching tag. The register file will store the new value of R3 at the same time when the matching tag appears on the result bus.

Webenson Diolen · Post by **Webenson Diolen** » 2020-05-20, 12:04:52

The big problem that might occur is that there will be memory violations. But in that event, you will want to stop the processing of the program
anyway. If the reads are checked for their privilege before dispatching the instructions and the writes are checked for their privilege before
leaving the write queue, that I assume will be there anyway, everything will be fine.

With contention you mean, that the dispatcher puts your machine into a state where you get confronted with insufficient read/write-ports
to resolve a current machine state? Something like this?

I am very confident this can be made "bulletproof" because the idea behind it is, that you minimize loads after dispatching, because
essentially you have already read all data before dispatching and sending it upward the pipeline. If there is a defined point in the
pipeline every instruction does its final writeback, everything should be under control.

Register renaming, forwarding of register contents, or the reference tag scheme I mentioned will make it more complex.
But after all there is lots of hope to proof a specific design will not run in invalid states it can not recover from.

Post by **agner** » 2020-05-20, 13:35:56

There are many things that could possibly stall the pipeline:

Read and write to the same address, or the CPU is unable to check whether memory addresses are overlapping.
No vacant write buffer
Cache misses
No vacant register read port
No vacant result bus
An instruction is waiting for an operand from another pipeline or unit
No vacant bus for transferring data between different units or between different SIMD lanes
An instruction takes more than one clock cycle in the same pipeline stage, e.g. division
Hardware interrupt

It may be possible to predict some of these events one clock cycle in advance, but all of them?

Webenson Diolen · Post by **Webenson Diolen** » 2020-05-20, 16:36:22

You can work around all these things if your dispatcher has a very good "bookkeeping". The challenge is to keep it simple while still dispatching as
aggressive as possible. Some details will go hand in hand with your distinct definition of your ISA. The pipeline should be tailored for your special
instruction set and how it resolves in micro-ops.

There will be some bloat to keep track of the conditions that must be met to dispatch an instruction or not. This means control logic with small
counters. The question is if it is more economic to have one place to verrify that dispatching is safe or to relax this checking and offer
stall recovery when it goes wrong.

When instructions are not timely predictable on machine register level, my concept is not useful. If there are cases a division needs less time,
you will have no benefit of it. To check if dispatching is safe is only possible if instructions always take the same time.

I am going to turn this concept into a proof-of-concept.

forwardcom forum

Handshaking between CPU pipeline stages

Handshaking between CPU pipeline stages

Re: Handshaking between CPU pipeline stages

Re: Handshaking between CPU pipeline stages

Re: Handshaking between CPU pipeline stages

Re: Handshaking between CPU pipeline stages

Re: Handshaking between CPU pipeline stages

Re: Handshaking between CPU pipeline stages