When there are relatively short code segments that are repeated many times, a lot of theoretical performance is not available. Depending on the
exact architecture there are slowdowns after a jump. To unroll loops is a common attempt to raise performance in such cases. But in general
this can have drawbacks like increasing code size and more complex loop conditions.
I would propose a special hardware feature to repeat very few instructions, but without the typical slowdown of a jump.
How could it be done?
1.A single instruction is inserted to let the control logic know that there will be a loop. This instruction includes 2 values.
The first value indicates how many following instruction words belong to the control expression of the loop. The second value indicates the
actual number of instruction words inside the loop.
2.The control logic can use this information to constantly decode the right instructions. The program counter should be halted at the beginning
of the loop. The idea is to have a very small buffer that fills itself after the start instruction of the loop and functions as a ring buffer to feed the
instruction decoder as long as the control expression to leave the loop is not met.
A number of 25 to 50 instructions in such a cache could be enough to speed most loops up that currently would be tried to unroll. And
the expense in energy usage and transistor count would be certainly far below what any form of branch prediction required for the same
use case.
What about optimising for faster loops?
Moderator: agner
Re: What about optimising for faster loops?
I agree that loop unrolling can be a bad thing. It is a waste of code cache. Compilers are often unrolling loops excessively.
Intel CPUs do indeed have a buffer similar to what you are proposing. In fact, they have two. There is a loopback buffer that can recycle decoded instructions after the decoder in tiny loops, and there is a micro-op cache. AMD has something similar: A micro-op queue and a micro-op cache.
The need for such buffers is smaller on ForwardCom because instruction decoding should not be a bottleneck. Decoding is a serious bottleneck in x86 systems because of the legacy of very complicated instruction coding that makes it difficult to determine the length of an instruction.
The loop overhead is seldom a bottleneck in out-of-order processors if the loop body is more time consuming than updating the loop counter and perhaps a branch misprediction after the last iteration.
I have proposed a semi out-of-order design for ForwardCom: Simple integer instructions such as incrementing the loop counters and pointers are done in an in-order front end if the operands are ready. Branches are also resolved in the front end if the operands are ready, which a simple loop counter would always be because it is updated in a single clock cycle. These branches have short misprediction delays because the front end has only few pipeline stages. The time-consuming floating point instructions, vector instructions, and memory operations that we typically find in the loop body are sent to an out-of-order back end. Integer register operands such as loop counter and pointers that have been calculated in the front end are replaced by their values before the instructions go to the back end. Branches and loops can be resolved a long way ahead of the more time consuming calculations as long as they depend only on simple integer register instructions. I think that this would mask the branch misprediction delays in many cases.
A branch predictor will typically predict the backward branch at the end of a loop to be taken the first time. There will be one misprediction after the last iteration unless an advanced branch predictor can remember the loop count.
I don't think it is necessary to insert extra instructions to help predicting the loop in this case.
Intel CPUs do indeed have a buffer similar to what you are proposing. In fact, they have two. There is a loopback buffer that can recycle decoded instructions after the decoder in tiny loops, and there is a micro-op cache. AMD has something similar: A micro-op queue and a micro-op cache.
The need for such buffers is smaller on ForwardCom because instruction decoding should not be a bottleneck. Decoding is a serious bottleneck in x86 systems because of the legacy of very complicated instruction coding that makes it difficult to determine the length of an instruction.
The loop overhead is seldom a bottleneck in out-of-order processors if the loop body is more time consuming than updating the loop counter and perhaps a branch misprediction after the last iteration.
I have proposed a semi out-of-order design for ForwardCom: Simple integer instructions such as incrementing the loop counters and pointers are done in an in-order front end if the operands are ready. Branches are also resolved in the front end if the operands are ready, which a simple loop counter would always be because it is updated in a single clock cycle. These branches have short misprediction delays because the front end has only few pipeline stages. The time-consuming floating point instructions, vector instructions, and memory operations that we typically find in the loop body are sent to an out-of-order back end. Integer register operands such as loop counter and pointers that have been calculated in the front end are replaced by their values before the instructions go to the back end. Branches and loops can be resolved a long way ahead of the more time consuming calculations as long as they depend only on simple integer register instructions. I think that this would mask the branch misprediction delays in many cases.
A branch predictor will typically predict the backward branch at the end of a loop to be taken the first time. There will be one misprediction after the last iteration unless an advanced branch predictor can remember the loop count.
I don't think it is necessary to insert extra instructions to help predicting the loop in this case.