Branch instruction with a "one-hot" bitmask equality comparison
Posted: 2021-09-10, 21:36:17
This is a proposal for a branch instruction that allows flexible equality comparisons.
The proposed branch instruction would have a bitmask, where each bit corresponds to a value of a source register (for example, 32 bit bitmask corresponds to source register range from 0 to 31) - if the corresponding bit in the bitmask is set, the branch is taken.
E. g. consider bitmask "... 0010 0110":
If the source register value is equal to 0, then the branch isn't taken, because bit 0 of the bitmask is 0.
But if the source register is equal to 1, then the branch is taken, because bit 1 of the bitmask is 1.
Same for number 2, bit 2 of the bitmask is 1, so the branch is taken.
This way the proposed branch instruction can incorporate several equality comparisons into a single instruction.
The mask can be signed or unsigned, and can be represented either as an immediate (in that case it's probably 32 bit wide) or can be loaded from a register (in that case it's probably 64 bit wide).
32 bit signed range: -15 to 16
32 bit unsigned: 0 to 31
64 bit signed range: -31 to 32
64 bit unsigned: 0 to 63
If the value of the source register is out of range, then the branch is not taken.
The range is quite limited, but should be sufficient for majority of conditions. Also, the signed range doesn't have to begin at -15/-31, the beginning can be shifted around a bit.
Usage:
1. Branches like shown below can be implemented as a single instruction with a mask of "... 0010 0100 1000"
This has a couple of effects:
2. Masking off loop iterations
The bitmask can be set up such that the branch will skip certain iterations of a loop - with a mask of "...0010 0110":
When loop_iterator is equal to zero, the loop is executed. When loop iterator is equal to one or two, the loop is skipped. Three and four are executed and so on.
The bitmask can be either generated at compile time and stored as an immediate (if it is known at compile time that certain iterations will be skipped), or it can be generated at runtime and stored in a register (for example to dynamically skip some elements of an array).
While the decreased BTB pollution and I believe1 quite significantly fewer bytes per two/three/four way equality comparisons are nice, in my eyes the main improvement is the option to improve the predictability of some branches through combining them - as branch prediction creates a fundamental limit to the CPU single threaded performance, I believe ForwardCom should try to improve the predictability of branches as much as possible.
Of course, predictability can be brute forced with even stronger, larger, more complicated and costly branch predictors, but even brute forcing has its limits and for sure isn't an efficient nor elegant solution.
1. I'm no expert in ForwardCom instruction templates, and so I'm not 100% sure as in which template(s) would this instruction be implemented, and therefore I cannot say concrete numbers how many bytes less are needed.
The proposed branch instruction would have a bitmask, where each bit corresponds to a value of a source register (for example, 32 bit bitmask corresponds to source register range from 0 to 31) - if the corresponding bit in the bitmask is set, the branch is taken.
E. g. consider bitmask "... 0010 0110":
If the source register value is equal to 0, then the branch isn't taken, because bit 0 of the bitmask is 0.
But if the source register is equal to 1, then the branch is taken, because bit 1 of the bitmask is 1.
Same for number 2, bit 2 of the bitmask is 1, so the branch is taken.
This way the proposed branch instruction can incorporate several equality comparisons into a single instruction.
The mask can be signed or unsigned, and can be represented either as an immediate (in that case it's probably 32 bit wide) or can be loaded from a register (in that case it's probably 64 bit wide).
32 bit signed range: -15 to 16
32 bit unsigned: 0 to 31
64 bit signed range: -31 to 32
64 bit unsigned: 0 to 63
If the value of the source register is out of range, then the branch is not taken.
The range is quite limited, but should be sufficient for majority of conditions. Also, the signed range doesn't have to begin at -15/-31, the beginning can be shifted around a bit.
Usage:
1. Branches like shown below can be implemented as a single instruction with a mask of "... 0010 0100 1000"
Code: Select all
if( (x == 3) || (x == 6) || (x == 9) ){
.
.
}
- Higher WPI/lower number of Bytes to implement the same condition
- Fewer branches means fewer BTB entries -> less BTB pollution
- Joining branches can lead to better predictability
https://youtu.be/HG6c4Kwbv4I - in Matt Godbolts example, he had two branches with a probability of being taken roughly 50%, both branches had a ton of mispredictions (they were mispredicted 44-46% of time). But when he joined these two branches into one, their probability of being taken changed to 99% and almost all mispredictions were gone.
Proposed branch instruction gives the compiler/programmer the option to join a couple of equality comparison branches into a single branch - which I believe isn't currently possible in ForwardCom ISA, correct me if I'm wrong.
Of course, joining branches can also lead to worse predictability and can also affect the predictability of other branches in the code (for the good and also for the worse) - has there been done any research into predictability of separate/joined branches, or predictability of branches in general?
2. Masking off loop iterations
The bitmask can be set up such that the branch will skip certain iterations of a loop - with a mask of "...0010 0110":
When loop_iterator is equal to zero, the loop is executed. When loop iterator is equal to one or two, the loop is skipped. Three and four are executed and so on.
Code: Select all
loop:
if loop_counter is not in mask -> goto incrementation
.
. //do work
.
increment loop_counter
jump to top of the loop if condition satisfied
While the decreased BTB pollution and I believe1 quite significantly fewer bytes per two/three/four way equality comparisons are nice, in my eyes the main improvement is the option to improve the predictability of some branches through combining them - as branch prediction creates a fundamental limit to the CPU single threaded performance, I believe ForwardCom should try to improve the predictability of branches as much as possible.
Of course, predictability can be brute forced with even stronger, larger, more complicated and costly branch predictors, but even brute forcing has its limits and for sure isn't an efficient nor elegant solution.
1. I'm no expert in ForwardCom instruction templates, and so I'm not 100% sure as in which template(s) would this instruction be implemented, and therefore I cannot say concrete numbers how many bytes less are needed.