As I don't have any experience concerning multi-stage pipelines in HDLs, I can't really say if your proposal is a good way of doing it. Looking at it, it should work, I might have some concerns with timing/routing, as every stage has to be able to send a stall_out signal to a central module and receive a stall_in after calculation, all in the same clock cycle.
As I had to implement stalling support into my CPU simulator, I want to explain an HDL-adapted version of it as an alternative, maybe you can take some inspiration from this:
Modules have
- inputs and outputs for instruction data
- a stall_in signal that tells them not to commit their result to the next stage in the current clock cycle
- a stall_out signal for them to tell if they are unable to use their inputs in the current cycle
The actual stall logic is implemented in the registers at the stage-boundaries. A stage boundary consists of
- the inputs from the previous stage and the outputs to the next stage
- a stall_in signal it receives from the stall_out signal of the next stage
- a stall_out signal it sends to the stall_in of the previous stage
- the main registers, they forward their contents to the next stage, but only clock if stall_in is low
- a flip-flop that takes stall_in, delays it by a clock cycle and sends it to stall_out
- a second set of buffer registers, they forward their content to the main registers and clock only one time when stall_in changes to high
- a multiplexer that forwards either the data input to the main registers if the delayed stall signal is low, or the contents of the buffer registers
When a stall occurs, the current output from the previous stage is written to the buffer registers instead of the main ones. Because of this, the previous stage can continue normal operation for one more clock cycle. During the clock the stall is resolved, the main registers receive the content from the buffer instead of the previous stage, as it is stalled for one more cycle.
The clock condition of the buffer registers can be simplified to always clocking if the delayed signal is low, at the cost of increased power draw.
This design also always incurs a stall penalty of one clock cycle if the previous stage sends an instruction for the first time since the next stage stalled, and in the same clock the stall is resolved. This can be mitigated by gating off the delayed clock signal in the boundary until the buffered registers get data.
The information if there is any data in the registers can be obtained either by checking if certain bits are zero (all zeroes in a instruction word is a nop in ForwardCom) or by saving a seperate signal in the registers.
One main advantage of this approach is the relaxed timing requirements. Except the time needed to calculate the boundary controls, stall calculation can take the entire clock cycle. The other big advantage is having only one extra signal between modules, and only between stages that are connected anyway.
One of the disadvantages is that a multiplexer is needed in what is likely to be the critical path, thus decreasing the maximum clock speed. The other is potentially needing a lot of gates as be boundary registers are doubled. I'm not sure if this applies to FPGAs as well, as registers might be implemented much denser than logic. Also, if disabling/gating a clock for a register is not supported, workaround logic is needed.