What about optimising for faster loops?
Posted: 2020-04-17, 18:44:15
When there are relatively short code segments that are repeated many times, a lot of theoretical performance is not available. Depending on the
exact architecture there are slowdowns after a jump. To unroll loops is a common attempt to raise performance in such cases. But in general
this can have drawbacks like increasing code size and more complex loop conditions.
I would propose a special hardware feature to repeat very few instructions, but without the typical slowdown of a jump.
How could it be done?
1.A single instruction is inserted to let the control logic know that there will be a loop. This instruction includes 2 values.
The first value indicates how many following instruction words belong to the control expression of the loop. The second value indicates the
actual number of instruction words inside the loop.
2.The control logic can use this information to constantly decode the right instructions. The program counter should be halted at the beginning
of the loop. The idea is to have a very small buffer that fills itself after the start instruction of the loop and functions as a ring buffer to feed the
instruction decoder as long as the control expression to leave the loop is not met.
A number of 25 to 50 instructions in such a cache could be enough to speed most loops up that currently would be tried to unroll. And
the expense in energy usage and transistor count would be certainly far below what any form of branch prediction required for the same
use case.
exact architecture there are slowdowns after a jump. To unroll loops is a common attempt to raise performance in such cases. But in general
this can have drawbacks like increasing code size and more complex loop conditions.
I would propose a special hardware feature to repeat very few instructions, but without the typical slowdown of a jump.
How could it be done?
1.A single instruction is inserted to let the control logic know that there will be a loop. This instruction includes 2 values.
The first value indicates how many following instruction words belong to the control expression of the loop. The second value indicates the
actual number of instruction words inside the loop.
2.The control logic can use this information to constantly decode the right instructions. The program counter should be halted at the beginning
of the loop. The idea is to have a very small buffer that fills itself after the start instruction of the loop and functions as a ring buffer to feed the
instruction decoder as long as the control expression to leave the loop is not met.
A number of 25 to 50 instructions in such a cache could be enough to speed most loops up that currently would be tried to unroll. And
the expense in energy usage and transistor count would be certainly far below what any form of branch prediction required for the same
use case.