I have written in the specifications that it is recommended to have two separate stacks - a call stack for return addresses, and a data stack for local data. This was for security reasons to prevent buffer overflow attacks. Now I have found another advantage of separate stacks.
While making the instruction fetch unit for a softcore, I discovered that it is possible to execute unconditional jumps, calls, and returns directly in the fetch unit without sending these instructions through the rest of the pipeline. As long as the fetch unit can fetch instructions from the code cache faster than the rest of the pipeline can handle them, you get jumps, calls, and returns virtually for free. But this is only possible if there is a separate call stack with its own stack pointer. This trick is not possible with a shared stack and a shared stack pointer because the stack pointer may be modified by all kinds of instructions. It has to go all the way through the pipeline in order to check if the stack pointer is modified by other instructions further down the pipeline. The cost is a delay of several clock cycles for every return instruction unless you have a complicated branch prediction mechanism. A link register for return addresses would certainly not be any better because it has to be pushed on the stack in all non-leaf functions.
I have made a fetch unit in an FPGA. It is fetching instructions at a rate of two 32-bit words per clock from the code cache into a buffer of 6 words. The fetch unit is able to identify unconditional jumps, calls, and return instructions in the first three instructions in the buffer. It will push or pop addresses on the call stack, and load the target address from the code cache as early as possible. It works!
This requires that the call stack pointer is not touched by any other instructions than call and return. We may need special instructions to manipulate the call stack in the case that an exception handler needs to unroll the stack. But this is such a rare event that we can afford to flush the pipeline in this case. ForwardCom will not trap numerical exceptions anyway, as I have argued elsewhere.
The FPGA chip has a lot of on-chip RAM blocks. A single RAM block is big enough to hold a call stack of 1023 return addresses. I cannot think on any application that may require a call stack deeper than this, even with recursive functions. So we will probably never need to spill an overflowing call stack to RAM.
The advantage of early handling of call and return is so big that I think we should make separate stacks required rather than optional. Can you think of any applications where a combined stack would be necessary?
Separate call stack and data stack
Moderator: agner
Re: Separate call stack and data stack
I can't think of an application that needs a combined stack. I am however opposed to using implementation details that don't change functionality as an argument for such a requirement. Of course, the found advantages can be noted in the ForwardCom manual and security is still a valid argument. I am not opposed to the requirement itself.
The branch prediction needed for avoiding the penalty on returns shouldn't be that high. Returns could be identified via address or in the fetch/pre-decode/decode stage. A return address buffer would then yield the same effect as the dual stack version, provided the core supports speculative execution and the code is well-behaving, otherwise the pipeline would have to be flushed, of course.
I agree that a 1023 return addresses should be enough for just about anything, but you can blow through this pretty easily if you write an algorithm that scales at least linearly with problem size. Then again, I don't know if that could be considered well-behaving code and I believe I read elsewhere in the forum or on your blog that ForwardCom emphasizes accelerating good code more than accelerating bad code.
All in all, I think many parts of the pipeline being easy to optimize is one of the greatest strenghts of ForwardCom and I am happy to see that this trend seems to continue. I am exited to see what your softcore can do when it is finished.
The branch prediction needed for avoiding the penalty on returns shouldn't be that high. Returns could be identified via address or in the fetch/pre-decode/decode stage. A return address buffer would then yield the same effect as the dual stack version, provided the core supports speculative execution and the code is well-behaving, otherwise the pipeline would have to be flushed, of course.
I agree that a 1023 return addresses should be enough for just about anything, but you can blow through this pretty easily if you write an algorithm that scales at least linearly with problem size. Then again, I don't know if that could be considered well-behaving code and I believe I read elsewhere in the forum or on your blog that ForwardCom emphasizes accelerating good code more than accelerating bad code.
All in all, I think many parts of the pipeline being easy to optimize is one of the greatest strenghts of ForwardCom and I am happy to see that this trend seems to continue. I am exited to see what your softcore can do when it is finished.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Separate call stack and data stack
I'm very much reminded of the register-window stack on SPARC which has similar semantics, and stores return addresses as part of its mechanism: http://icps.u-strasbg.fr/people/loechne ... stack.html
Any local variable that is stored on the stack and that isn't accessible by a pointer can be put in a second alternate stack - including register spills, most non-array variables, and indeed return addresses.
The real complexity in having a second stack for return addresses is that the OS has to empty/fill it on task switches, in order to have different return stacks for each program, in addition to presumably a system return address stack (for interrupts in interrupts). So I imagine it would behave like a small specialized RAM more than a built-in-stack. If the return stack overflows or underflows, the OS can simply catch the exception and partially empty/refill the return stack, so mostly unlimited stack space is definitely possible.
Any local variable that is stored on the stack and that isn't accessible by a pointer can be put in a second alternate stack - including register spills, most non-array variables, and indeed return addresses.
The real complexity in having a second stack for return addresses is that the OS has to empty/fill it on task switches, in order to have different return stacks for each program, in addition to presumably a system return address stack (for interrupts in interrupts). So I imagine it would behave like a small specialized RAM more than a built-in-stack. If the return stack overflows or underflows, the OS can simply catch the exception and partially empty/refill the return stack, so mostly unlimited stack space is definitely possible.
-
- Posts: 7
- Joined: 2020-04-17, 11:04:32
Re: Separate call stack and data stack
What I like about the idea of a seperate call stack and data stack is, that parallel resources tend to be much faster than mixed purpose data
on the same data track. Also to evade some issues of unsafe programming such a feature is more than useful.
After all you will have to speficy some way it actually gets implemented into your ISA. So this is straightforward, that its an architecture
feature from the beginning on. And this means, you have to offer some numbers and logic diagrams that every revision of forwardcom
will be bound to, to stay compatible to your ISA. 1023 return addresses sound like a good pick.
So how to chose the numbers and the logic? Keep it simple if you can! Simple is often faster! Let the stacks only grow in one direction for
the corresponding instructions that access it. You should not add any complexity to your ISA, for extended bidirectional stack access.
If someone needs this in his software, I never heard of it, he can write a program for it. Chances are nobody really needs it.
For the call stack I really wished you could save not only the return address, but some kind of additional value. This value could be used for
security features for example. You would give the called function no chance to tamper the value to check for after the return.
This would be so useful to check if memory bounds were violated or the data stack is still in an expected range.
on the same data track. Also to evade some issues of unsafe programming such a feature is more than useful.
After all you will have to speficy some way it actually gets implemented into your ISA. So this is straightforward, that its an architecture
feature from the beginning on. And this means, you have to offer some numbers and logic diagrams that every revision of forwardcom
will be bound to, to stay compatible to your ISA. 1023 return addresses sound like a good pick.
So how to chose the numbers and the logic? Keep it simple if you can! Simple is often faster! Let the stacks only grow in one direction for
the corresponding instructions that access it. You should not add any complexity to your ISA, for extended bidirectional stack access.
If someone needs this in his software, I never heard of it, he can write a program for it. Chances are nobody really needs it.
For the call stack I really wished you could save not only the return address, but some kind of additional value. This value could be used for
security features for example. You would give the called function no chance to tamper the value to check for after the return.
This would be so useful to check if memory bounds were violated or the data stack is still in an expected range.
Re: Separate call stack and data stack
FORTH has always used a separate data stack from call stack. It's worked well for them.
w.r.t. hardware should make "function return preduction" somewhat easier, huh?
w.r.t. software a separate call stack can make some (modern) (functional) procedural patterns easier to implement - I'm thinking of mechanisms for tail call and general continuations. Both could be accomplished by simply rearranging arguments in registers, followed by a jump instruction (per "Lambda: The Ultimate GOTO").
There are probably very good reasons to prevent modifying the call stack except though call instructions, but there might be a need for a "pop N returns" instruction in order to do certain kinds of non-local return.
w.r.t. hardware should make "function return preduction" somewhat easier, huh?
w.r.t. software a separate call stack can make some (modern) (functional) procedural patterns easier to implement - I'm thinking of mechanisms for tail call and general continuations. Both could be accomplished by simply rearranging arguments in registers, followed by a jump instruction (per "Lambda: The Ultimate GOTO").
There are probably very good reasons to prevent modifying the call stack except though call instructions, but there might be a need for a "pop N returns" instruction in order to do certain kinds of non-local return.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Separate call stack and data stack
Still, I'm not convinced that this is a net decrease in complexity.
This creates a small new memory area, with its own addressing scheme separate from the main RAM. Since it's not affected by RAM area mapping, it needs to be integrally copied to/from main RAM during context switches. That means you need instructions to read and write from this small memory (separate from the reads/writes happening in CALL and RET instructions) so that the OS can swap it.
In an out-of-order CPU, you need to furthermore buffer all the writes to the memory area caused by CALL and context-switch-OS-writes so that you can rewind the effect of a CALL, and read operations need to check against these potentially buffered writes.
I don't think it makes branch prediction easier, compared to the case where you have a return stack predictor in your branch predictor. Even without a special separate call stack, you'd simply treat jump-and-link instructions as pushing into the return stack, and jump-to-link-register as popping from the return stack.
I guess the upshot is that the instruction fetch unit works rather autonomously in an in-order core setting since it removes many cases where it has a dependency on results from the execution unit. It would definitely potentially make sense for a microcontroller design.
This creates a small new memory area, with its own addressing scheme separate from the main RAM. Since it's not affected by RAM area mapping, it needs to be integrally copied to/from main RAM during context switches. That means you need instructions to read and write from this small memory (separate from the reads/writes happening in CALL and RET instructions) so that the OS can swap it.
In an out-of-order CPU, you need to furthermore buffer all the writes to the memory area caused by CALL and context-switch-OS-writes so that you can rewind the effect of a CALL, and read operations need to check against these potentially buffered writes.
I don't think it makes branch prediction easier, compared to the case where you have a return stack predictor in your branch predictor. Even without a special separate call stack, you'd simply treat jump-and-link instructions as pushing into the return stack, and jump-to-link-register as popping from the return stack.
I guess the upshot is that the instruction fetch unit works rather autonomously in an in-order core setting since it removes many cases where it has a dependency on results from the execution unit. It would definitely potentially make sense for a microcontroller design.
Re: Separate call stack and data stack
The mill cpu also does this, for similar reasons.
I've had call stacks several thousand levels deep on multiple occasions, and it's always been due to a program bug. (Well, one time it wasn't a bug, but it was still unintended behaviour.) 1023 is plenty.
Languages that depend on deep levels of recursion usually also guarantee tail call optimization. If your algorithm isn't tail recursive, you can still get around the problem using continuation passing style.I agree that a 1023 return addresses should be enough for just about anything, but you can blow through this pretty easily if you write an algorithm that scales at least linearly with problem size
I've had call stacks several thousand levels deep on multiple occasions, and it's always been due to a program bug. (Well, one time it wasn't a bug, but it was still unintended behaviour.) 1023 is plenty.