Separate call stack and data stack
Posted: 2020-05-13, 12:32:27
I have written in the specifications that it is recommended to have two separate stacks - a call stack for return addresses, and a data stack for local data. This was for security reasons to prevent buffer overflow attacks. Now I have found another advantage of separate stacks.
While making the instruction fetch unit for a softcore, I discovered that it is possible to execute unconditional jumps, calls, and returns directly in the fetch unit without sending these instructions through the rest of the pipeline. As long as the fetch unit can fetch instructions from the code cache faster than the rest of the pipeline can handle them, you get jumps, calls, and returns virtually for free. But this is only possible if there is a separate call stack with its own stack pointer. This trick is not possible with a shared stack and a shared stack pointer because the stack pointer may be modified by all kinds of instructions. It has to go all the way through the pipeline in order to check if the stack pointer is modified by other instructions further down the pipeline. The cost is a delay of several clock cycles for every return instruction unless you have a complicated branch prediction mechanism. A link register for return addresses would certainly not be any better because it has to be pushed on the stack in all non-leaf functions.
I have made a fetch unit in an FPGA. It is fetching instructions at a rate of two 32-bit words per clock from the code cache into a buffer of 6 words. The fetch unit is able to identify unconditional jumps, calls, and return instructions in the first three instructions in the buffer. It will push or pop addresses on the call stack, and load the target address from the code cache as early as possible. It works!
This requires that the call stack pointer is not touched by any other instructions than call and return. We may need special instructions to manipulate the call stack in the case that an exception handler needs to unroll the stack. But this is such a rare event that we can afford to flush the pipeline in this case. ForwardCom will not trap numerical exceptions anyway, as I have argued elsewhere.
The FPGA chip has a lot of on-chip RAM blocks. A single RAM block is big enough to hold a call stack of 1023 return addresses. I cannot think on any application that may require a call stack deeper than this, even with recursive functions. So we will probably never need to spill an overflowing call stack to RAM.
The advantage of early handling of call and return is so big that I think we should make separate stacks required rather than optional. Can you think of any applications where a combined stack would be necessary?
While making the instruction fetch unit for a softcore, I discovered that it is possible to execute unconditional jumps, calls, and returns directly in the fetch unit without sending these instructions through the rest of the pipeline. As long as the fetch unit can fetch instructions from the code cache faster than the rest of the pipeline can handle them, you get jumps, calls, and returns virtually for free. But this is only possible if there is a separate call stack with its own stack pointer. This trick is not possible with a shared stack and a shared stack pointer because the stack pointer may be modified by all kinds of instructions. It has to go all the way through the pipeline in order to check if the stack pointer is modified by other instructions further down the pipeline. The cost is a delay of several clock cycles for every return instruction unless you have a complicated branch prediction mechanism. A link register for return addresses would certainly not be any better because it has to be pushed on the stack in all non-leaf functions.
I have made a fetch unit in an FPGA. It is fetching instructions at a rate of two 32-bit words per clock from the code cache into a buffer of 6 words. The fetch unit is able to identify unconditional jumps, calls, and return instructions in the first three instructions in the buffer. It will push or pop addresses on the call stack, and load the target address from the code cache as early as possible. It works!
This requires that the call stack pointer is not touched by any other instructions than call and return. We may need special instructions to manipulate the call stack in the case that an exception handler needs to unroll the stack. But this is such a rare event that we can afford to flush the pipeline in this case. ForwardCom will not trap numerical exceptions anyway, as I have argued elsewhere.
The FPGA chip has a lot of on-chip RAM blocks. A single RAM block is big enough to hold a call stack of 1023 return addresses. I cannot think on any application that may require a call stack deeper than this, even with recursive functions. So we will probably never need to spill an overflowing call stack to RAM.
The advantage of early handling of call and return is so big that I think we should make separate stacks required rather than optional. Can you think of any applications where a combined stack would be necessary?