Author: csdt Date: 2017-01-27 10:21
Branch predictors is currently a hard part of a CPU.
I propose to help it a bit in the same way we can help the cache: tell it in advance where we want to jump.
I think this can be done by a couple of instructions:
- chose the destination from 2 immediates based on a condition
- unconditionally set the destination of a jump from a register
- unconditionally jump to the previously set destination
This can help basic control flow if the program is not limited by the instruction bandwidth with the second instruction.
And it can also help indirect calls (virtual call through a vtable for instance).
For safety reason, this "special register" should be saved and restored during context switches.
For security reason, the "special register" should be reset after each jump whatever instruction is used for the jump, and a signal should be send if the third instruction is executed while the "special register" is not in a valid state.
What do you think of this ?
Author: Agner Date: 2017-01-27 11:45
csdt wrote:
Branch predictors is currently a hard part of a CPU.
I propose to help it a bit in the same way we can help the cache: tell it in advance where we want to jump.
The hard part of branch prediction is not to predict the target address, but to predict whether it will jump or not. The target address can be calculated at an early stage in the pipeline for direct conditional and unconditional jumps. Certain instruction formats are reserved for jump instructions so that these instructions can be identified easily. The earlier in the pipeline we can calculate the target address, the less is the penalty for not knowing the address of the next instruction, and the less is the need for large branch target buffers (BTB).
It is much harder to predict whether it will jump or not. A good predictor contains large pattern tables that consume a lot of silicon space.
We can reduce the branch misprediction penalty by having a short pipeline and perhaps by fetching and decoding both targets of a two-way branch.
Multiway branches are even harder to predict, but we cannot do without them.
Function returns are easy to predict. If we have separate data stack and call stack, then the call stack will usually be so small that it can reside on the chip.
Author: csdt Date: 2017-01-30 05:06
The point of this solution is to replace conditional jumps by unconditional jumps to predefined targets.
It is always possible to consider a conditional jump like an unconditional jump where the target is either the one we want, or the instruction just after the jump depending on the condition.
Doing this, we can use unconditional jumps even with branches and loops, with a predefined address (set with a previous instruction).
It is possible to have no-latency branches and loops (with no need to prediction), but requires an extra instruction per jump in the instruction flow.
Author: Agner Date: 2017-01-30 07:14
csdt wrote:
The point of this solution is to replace conditional jumps by unconditional jumps to predefined targets.
It is not very clear what you mean here. It is possible, of course, to put the jump target address in a register and then do an indirect jump to the register value. This doesn't solve the problem in current hardware designs because the value of the target register is not known at the time the next instruction has to be fetched into the pipeline. You would need to store the target address several clock cycles in advance and then have a mechanism for signalling to the instruction fetcher at what point it would have to fetch from the new address. The compiler then has to place the switch-target instruction at the right position in the instruction sequence. The distance from the switch-target instruction to the point of divergence has to fit the length of the pipeline, which depends on the specific hardware implementation. This may be possible, but I am not aware of any attempt to construct such a machinery.
It is always possible to consider a conditional jump like an unconditional jump where the target is either the one we want, or the instruction just after the jump depending on the condition.
Some instruction sets have an instruction for conditionally skipping the next instruction. If the instruction that you skip is not a jump instruction, then this is in effect a conditional execution of the instruction. ForwardCom can execute instructions conditionally by using a predicate or mask. This avoids misprediction, but it doesn't save time because the time it takes to not-execute an instruction is the same as the time it takes to execute it. The reason for this is that the value of the predicate register is not known at the time the instruction is scheduled. It might be possible to construct a scheduler that checks whether the value of the predicate register is known sufficiently early to skip the instruction completely and not schedule it for execution. This would require a quite complicated hardware design, and I don't think it has ever been done.
If you have a system that conditionally skips one instruction, and the instruction you skip is a jump instruction, then you have a situation that is equivalent to a conditional jump. The instruction fetcher will not know in advance which instruction comes next.
This all boils down to the issue of pipelining. Current high-end microprocessors have a pipeline of typically 10-20 stages. This means that it has to know 10-20 clock cycles in advance which instruction to fecth next. If this information is not actually known 10-20 clock cycles in advance then you cannot avoid a waste of time if you make the wrong guess, no matter how you implement the branching. Any mechanism that makes this information known to the instruction fetcher sufficiently early would be quite complicated to implement both in the hardware and in the compiler.
Author: csdt Date: 2017-01-30 09:53
Agner wrote:
csdt wrote:
The point of this solution is to replace conditional jumps by unconditional jumps to predefined targets.
It is not very clear what you mean here. It is possible, of course, to put the jump target address in a register and then do an indirect jump to the register value. This doesn't solve the problem in current hardware designs because the value of the target register is not known at the time the next instruction has to be fetched into the pipeline. You would need to store the target address several clock cycles in advance and then have a mechanism for signalling to the instruction fetcher at what point it would have to fetch from the new address. The compiler then has to place the switch-target instruction at the right position in the instruction sequence. The distance from the switch-target instruction to the point of divergence has to fit the length of the pipeline, which depends on the specific hardware implementation. This may be possible, but I am not aware of any attempt to construct such a machinery.
The idea is not to help the prediction, but to "override" the prediction for a single given jump instruction. So there is no failed prediction in the same way as you cannot "mispredict" an unconditional jump to a fixed target.
Let me give you a few examples (pseudo-asm syntax).
The 3 instructions are:
- set_jmp target
- set_jmp cond target0 target1 (set target to "target0" if "cond" is 0 and to "target1" otherwise)
- jmp2target
Here is a simple loop:
(prolog)
loop:
%r01 = sub.i32 %r01 1
set_jmp %r01 loop_end loop
(loop body)
jmp2target
loop_end:
(epilog)
So basically, the jump is always taken, and you know exactly the address of the jump in advance.
Of course, you need to know when load the instructions of the target address, but this can be handled by the branch predictor like a regular unconditional jump to a fixed address.
The same applies to if conditions:
%01 = (compute the condition)
set_jmp %r01 if_end if_then
(some unrelated computation)
jmp2target
if_then:
(conditional computation)
if_end:
(unconditional computation)
Of course, if you set the target too early, you need to wait, so you need to execute the "set_jmp" early enough, but the same problem arises with data prefetching anyway.
Author: Agner Date: 2017-01-31 01:40
csdt wrote:
The 3 instructions are:
- set_jmp target
- set_jmp cond target0 target1 (set target to "target0" if "cond" is 0 and to "target1" otherwise)
- jmp2target
Here is a simple loop:
(prolog)
loop:
%r01 = sub.i32 %r01 1
set_jmp %r01 loop_end loop
(loop body)
jmp2target
loop_end:
(epilog)
So basically, the jump is always taken, and you know exactly the address of the jump in advance.
OK. Now I think I understand you.
Your idea is somewhat like an extension to a "delay slot". Some old RISC architectures (MIPS, SPARC, OpenRisc) have a delay slot of one or two instructions after a jump. The instructions in the delay slot are executed before the jump is executed. The main problem with these delay slots is that they are optimized for a particular hardware implementation. A new implementation with a longer pipeline and more parallelism will need a longer delay slot.
This makes me wonder if it would be useful to have a variable-length delay slot. A branch instruction could have an 8-bit field indicating a delay slot of 0-255 instruction words. The compiler will place the delayed branch instruction as early in the code as it can. The instruction fetcher in the CPU will then know in advance what to fetch after the branching point. If the fetcher has already fetched past the branching point at the time the delayed branch is executed then it has to discard the extra instructions and fetch again. This will give a bubble in the pipeline, but this bubble will be smaller the longer the delay slot is.
There are known problems with delay slots: Interrupts, debug breakpoints, and jumps in the delay slot lead to extra complications.
We might overcome these problems by making the delayed jump instruction a "prediction hint" to be confirmed later by a regular branch instruction at the actual branching point. Only in the rare case that the prediction hint is lost or compromized due to interrupts or whatever will we have a misprediction at the branch point.
The delay slot has to be quite long. Modern microprocessors can execute 3 or 4 instructions per clock cycle and have a pipeline of 10-20 stages. With 3 instructions per clock on average and a 10-stage pipeline we would need 30 instructions to fill the delay slot. If no other jump instructions are allowed in the delay slot then we would rarely be able to make a sufficiently long delay slot.
Now, back to your proposal. I understand that you want to have a separate register for jump targets only. We would have to set this jump target register some 30 instructions in advance in order to keep the pipeline full. I think it is unlikely that we have no other branch instructions in this 30 instruction space, so we need more than one jump target register. If we need many jump target registers, then we might as well use the general purpose registers for jump targets instead. So, now your set-jump-target instruction just becomes ordinary instructions that calculate a jump target and put it into a register. And your jump-to-target instruction becomes a register-indirect jump. Now the problem becomes how to make the register value available to the branch prediction front end of a superscalar processor. The instructions that set the jump target are executed in the out-of-order back end, while the branch predictor operates in the in-order front end.
This makes me think, is it possible to execute simple instructions already in the in-order front end of a superscalar processor? If we can execute the instructions that calculate the jump target already in the in-order front end, then we have reduced the necessary delay to a few clock cycles. I am imagining a dual execution system: We have a system of in-order ALUs and out-of-order ALUs. Simple instructions are executed in the in-order ALUs if the operands are available already while the instruction is decoded in the in-order front ind. If the operands are not available yet, then the instruction is passed on to the out-of-order back end.
Here, we need a system to track all the general purpose registers to see if they are ready. We can have a "register cache" in the in-order front end. Every time the decoder sees an instruction that writes to a register, it will mark the entry for this register as dirty and remember which instruction it was. When this instruction is later executed (in order or out of order) the value is written to the register cache and marked as clean (unless another instruction writing to the same register has been encountered in the meantime). Simple instructions can execute in the in-order system if all their input operands are ready. The result is stored both in the register cache and passed on to the register renamer/allocater.
Branch instructions, jumps, calls, etc. should execute in the in-order front end. The compiler should make sure that any registers that the branch depends on are calculated as early as possible, and preferably using simple instructions that can execute in the in-order front end. If a branch depends on a memory operand, such as a jump table or table of function pointers, then the compiler should load the memory operand into a register as early as possible and then make a register-indirect jump.
A simple loop with a loop counter could benefit from this. The increment and compare instructions for the loop counter will be simple instructions that are likely to execute in the in-order front end so that the values are available early.
Multiway branches are particularly difficult to predict, and they could benefit from a system that calculates the target address in advance. For example, a byte code interpreter typically loads one byte at a time and uses this code as index into a multiway jump table. The software can easily read the next byte code in advance, fetch the target from the jump table and store it in a register.
This is an interesting discussion. The software may be able to calculate branch targets early in many cases, but the intricacies of out-of-order execution makes it complicated to forward this information to the in-order front end. Whatever solution we may come up with might be complicated, but still use less silicon space then the complicated branch prediction mechanisms that are used in modern microprocessors today.
Author: csdt Date: 2017-01-31 04:31
We now understand each other
The solution I propose shares some features with a delay slot, but solves the main issue: the delay slot is implementation-dependent.
In my solution, if you set the target too late, you will just wait and create a bubble in the pipeline, but the CPU will execute what you expect.
Agner wrote:
There are known problems with delay slots: Interrupts, debug breakpoints, and jumps in the delay slot lead to extra complications.
For interruptions, the target must be stored and loaded back at the end of the interruption. (You will probably lose the benefit of setting target early, but no undefined behavior here)
For jumps in the "delay slot" (between the target set and the actual jump), my first idea was to forbid this, and when this happens, trap when trying to jump with "jmp2target".
The way to solve this was to have a flag in the jump register indicating an incorrect value. If we try to jump with this flag -> trap.
This flag is set to 1 by default and reset to 1 every time a jump happens. This flag is then set to 0 when we define a target with "set_jmp".
Of course, these are just ideas and are adjustable.
Agner wrote:
Multiway branches are particularly difficult to predict, and they could benefit from a system that calculates the target address in advance. For example, a byte code interpreter typically loads one byte at a time and uses this code as index into a multiway jump table. The software can easily read the next byte code in advance, fetch the target from the jump table and store it in a register.
That was exactly that kind of code that makes me start this reflection on the subject (with vtables that are basically a jump table).
Now, I should mention that I'm not a hardware guy, but I had advanced Computer Architecture courses. (Basically, I know how it behaves, but I don't know how to implement it).
I think it would be better to have separate registers, that can live in the front-end world, and the "set_jmp" instruction is the interface between back-end and front-end worlds.
And in order to know if there is "set_jmp" being executed when the front-end has to fetch the "after-jump" instruction, we can have a flag in the special registers set by the front-end and telling an update of this register is pending.
In this case, you have to wait the register update by the back-end. In order to speed-up the thing further, the back-end could try to execute this instruction in priority.
Maybe you could add regular branch prediction when you have to wait. You will still benefit from the knowledge of the jump target by avoiding to flush the whole pipeline (only a part of it), but this could be much more complicated (I have no clue on this point).
I think it depends if you take this approach as the only way to branch, or if it is a complementary mechanism to branch.
I wonder about having multiple targets at the same time. It would be very handy for nested loops and nested control-flow in general, but some security issues arise.
For instance, with only one jump target, it is easy to detect a jump between "set_jmp" and "jmp2target" (see the beginning of this post). But with several of them, it becomes hard to detect without removing the benefit of having multiple jump targets.
I hope my explanations are understandable enough.
Author: Hubert Lamontagne Date: 2017-02-01 01:54
Agner wrote:
[...]
This makes me think, is it possible to execute simple instructions already in the in-order front end of a superscalar processor? If we can execute the instructions that calculate the jump target already in the in-order front end, then we have reduced the necessary delay to a few clock cycles. I am imagining a dual execution system: We have a system of in-order ALUs and out-of-order ALUs. Simple instructions are executed in the in-order ALUs if the operands are available already while the instruction is decoded in the in-order front ind.
That is sounding a lot like a Decoupled Access Execute architecture. Some documents about this:
https://cseweb.ucsd.edu/classes/wi09/cs ... oupled.pdf
http://personals.ac.upc.edu/jmanel/pape ... der CPUs)
http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf
http://spirit.cs.ucdavis.edu/pubs/conf/dynamicarch.pdf
One general problem with that kind of architecture is that it's hard to keep them simple, and they tend to end up with designs that are more complex than the simpler OOO cpus (in which case it's hard to argue against going with the OOO design).
If the operands are not available yet, then the instruction is passed on to the out-of-order back end.
Here, we need a system to track all the general purpose registers to see if they are ready.
A lot of simpler Out-Of-Order CPUs do something like this for memory address operands... For instance, the classic Athlon initiates memory loads and stores in-order. Due to this, instructions that generate memory addresses and don't depend on loads/stores will generally run quite earlier in the execution stream, and the readiness of the operands gets tracked by the scheduler. Maybe you could add a priority flag to instructions that affect control flow so that the scheduler will run them early (and possibly instructions that affect load/store addresses too).
We can have a "register cache" in the in-order front end. Every time the decoder sees an instruction that writes to a register, it will mark the entry for this register as dirty and remember which instruction it was. When this instruction is later executed (in order or out of order) the value is written to the register cache and marked as clean (unless another instruction writing to the same register has been encountered in the meantime). Simple instructions can execute in the in-order system if all their input operands are ready. The result is stored both in the register cache and passed on to the register renamer/allocater.
Branch instructions, jumps, calls, etc. should execute in the in-order front end. The compiler should make sure that any registers that the branch depends on are calculated as early as possible, and preferably using simple instructions that can execute in the in-order front end. If a branch depends on a memory operand, such as a jump table or table of function pointers, then the compiler should load the memory operand into a register as early as possible and then make a register-indirect jump.
A simple loop with a loop counter could benefit from this. The increment and compare instructions for the loop counter will be simple instructions that are likely to execute in the in-order front end so that the values are available early.
Multiway branches are particularly difficult to predict, and they could benefit from a system that calculates the target address in advance. For example, a byte code interpreter typically loads one byte at a time and uses this code as index into a multiway jump table. The software can easily read the next byte code in advance, fetch the target from the jump table and store it in a register.
This is an interesting discussion. The software may be able to calculate branch targets early in many cases, but the intricacies of out-of-order execution makes it complicated to forward this information to the in-order front end. Whatever solution we may come up with might be complicated, but still use less silicon space then the complicated branch prediction mechanisms that are used in modern microprocessors today.
Afaik, the case that really kills is this: Somewhat predictable branches that depend on memory operands (especially rarely used memory values that are loaded late and come in from L2 or later).
In C++ code this shows up as checks on rarely used or rarely changed flags to trigger rare special cases - for instance, logging errors on exceptional cases. This leaves yo with jumps that you're 99.99% sure will never be taken but will still take dozens of cycles to retire. I'm not sure I've seen any other solution to this than speculation and branch prediction (aside from the Transmeta Crusoe's strange backup-restore of the whole CPU state and Itanium's "insert omniscient compiler here" scheme)... and this mess has a kinda massive cost as far as I can tell (need to backup the result of every single instruction going through the pipeline, need large register file, need aggressive register renaming due to having so many renames per cycle).
Author: Agner Date: 2017-02-01 03:12
Hubert Lamontagne wrote:
That is sounding a lot like a Decoupled Access Execute architecture.
Thanks a lot for the links. Yes, this resembles what I am thinking about. The papers you are referring to are from the 1990's. I found a few newer articles on the topic, but no real progress:
http://www.jarnau.site.ac.upc.edu/isca12.pdf
http://www.diva-portal.org/smash/get/di ... FULLTEXT02
These articles still don't solve the problem of how to split the instruction stream into a control stream and a compute stream. This makes me think of the following idea. Let the compiler insert an instruction telling which registers are part of the control stream. For example, if register r1 is used as a loop counter, and r2 is controlling a branch, then tell the microprocessor that r1 and r2 should be privileged. Any instruction that has r1 or r2 as destination will go into the control stream. These instructions, as well as all branches, jumps and calls will be executed in the in-order front end as far as possible.
The front end should have access to the permanent register file. The back end may have out-of-order execution with register renaming. The renamed registers will be written to the permanent register file when they retire. We have to keep track of whether the permanent registers are up to date. When the decoder sees an instruction that writes to a register, it will mark this register as invalid in the permanent register file and remember which instruction it belongs to. When this instruction is executed (whether in the front end or the back end), the register entry is marked as valid, unless another instruction writing to the same register has been seen in the meantime.
The main problem we are trying to solve with this design is that the very long pipeline of superscalar processors is bad for branch instructions. The solution with a branch target register puts the branch in the front end, but not the calculation of the branch target. We would need perhaps 30 or 50 instructions between the calculation of the branch target and the branch itself in order to keep the pipeline full. To reduce this distance, we need to also put the calculation of the branch target, and everything that the branch depends on, in the front end as well. This includes not only branch targets, but also loop counters, branch conditions, indexes into jump tables, and all preceding calculations that these depend on. A separate set of registers and instructions for all this would be too complicated. I think it is easier to use the existing registers and tell the microprocessor which registers are part of the control stream. This wouldn't be too hard to implement in a compiler, I think. The question is how efficient we can make it in the front end.
Author: Hubert Lamontagne Date: 2017-02-01 17:21
There are two different ways to select values that can be sent to back-end:
- Value is never used in any memory address calculation
- Value is never used in any branch address calculation
They are actually somewhat orthogonal:
- A CPU can have a front-end that runs in-order but has register renaming and can be rolled back, which allows conditionals on the back-end but not memory index calculations
- A CPU can have a front-end that supports reordering but not speculative execution (all writebacks must be real!), allowing back-end indexing but not conditionals
Non-indexing and non-conditional values can be marked in SSA form: starting from the bottom, any value that gets "garbage collected" (goes out of scope, runs out of destination operations) without ever going to a memory load/store address or branch/jump-address can be marked, which in turn allows marking preceding values as non-indexing and/or non-conditional. The proportion of non-indexing and non-conditional values depends on the program (the test I did ended up with about 80% non-indexing values and 60% non-conditional values but this might not be typical at all). It might be very hard to perform this kind of marking through function calls or complex control flows though.
Author: Agner Date: 2017-02-02 01:11
Hubert Lamontagne wrote:
They are actually somewhat orthogonal:
- A CPU can have a front-end that runs in-order but has register renaming and can be rolled back, which allows conditionals on the back-end but not memory index calculations
- A CPU can have a front-end that supports reordering but not speculative execution (all writebacks must be real!), allowing back-end indexing but not conditionals
My intention is clearly number 2. The purpose is to shorten the branch misprediction delay to probably 3 clock cycles: fetch, decode, execute. The front end may allow reordering in the sense that an instruction can wait for an operand without delaying subsequent independent instructions. But it should not have register renaming in the front end.
I am focusing on branching here, which is clearly a weak spot in current superscalar designs. I am not intending to use this as a way of prefetching memory operands. The out-of-order mechanism is able to handle delayed memory operands satisfactorily.
My vision is this: The compiler is indicating which registers it intends to use for control flow. For example, it may use register r1 for loop counter, r2 for a branch condition, and r3 for index into a jump table. All instructions that modify r1, r2, and r3 will execute in the front end if possible. The remaining instructions are sent to the the back end. Instructions that have any of these registers as input operand will have the register operand replaced by its value before it is sent to the back end. Other register operands that are available in the permanent register file can likewise be replaced by their values. Instructions that execute in the front end may write their result not only to the permanent register file but also to any pending instruction that needs it.
The result of this mechanism is that a loop that is controlled by the marked registers only will be unrolled in the front end. The instructions in the loop body will be passed on to the back end with the loop counter replaced by its value. The back end has register renaming so that it can execute the body of the second iteration before it has finished the body of the first iteration.
All other control flow, such as branches, jumps, calls, returns, etc., will be unrolled in the front end as well so that control flow is effectively resolved before execution as far as it does not depend on delayed data.
The demand for advanced branch prediction is somewhat relaxed if we can drastically reduce the branch misprediction delay in this way. The branch prediction machinery in modern processors is very complicated and requires a lot of silicon space for branch target buffers and pattern/history tables. We can cut down on this and use a simpler branch prediction scheme if branches are resolved in the front end.
The front end may fetch and decode speculatively, but not execute speculatively. It may even fetch both sides of a branch that it expects to have poor prediction. Fetching both sides of a branch becomes cheaper here than with a traditional superscalar design because it only needs to duplicate two pipeline stages (fetch and decode) in order to minimize branch misprediction bubbles.
The fetch and decode stages should have higher throughput than the rest of the pipeline so that it can catch up and fill the pipeline after a branch bubble. The fetch stage should be able to fetch a large block of code in one clock cycle, possibly crossing a cache line boundary. The decoder should be able to decode the first one or two instructions in a fecthed code block in the first clock cycle, and determine the starting points of the remaining instructions. This will allow it to decode maybe four or more instructions in the second clock cycle after fetching a block of code. The aim of this is to keep the branch misprediction delay as low as possible.
In case the front end ALU is idle because there are no instructions that write to marked registers, it may execute any other instructions if their operands are ready.
Author: Agner Date: 2017-02-14 13:20
This idea about decoupling the control flow from execution is now included as a proposal in the manual version 1.06 (chapter 8.1).
http://www.agner.org/optimize/forwardcom.pdf
Author: Hubert Lamontagne Date: 2017-01-31 01:55
csdt wrote:
[...]
The idea is not to help the prediction, but to "override" the prediction for a single given jump instruction. So there is no failed prediction in the same way as you cannot "mispredict" an unconditional jump to a fixed target.
Let me give you a few examples (pseudo-asm syntax).
The 3 instructions are:
- set_jmp target
- set_jmp cond target0 target1 (set target to "target0" if "cond" is 0 and to "target1" otherwise)
- jmp2target
[...]
So basically, the jump is always taken, and you know exactly the address of the jump in advance. Of course, you need to know when load the instructions of the target address, but this can be handled by the branch predictor like a regular unconditional jump to a fixed address.
[...]
You'd need to compute the conditional at minimum ~10 instructions before the jump, even in the best case on the shortest pipelines (and with a bypass path straight from the register bypass network to instruction cache address input!). For instance, with a 3-instruction-per-cycle cpu:
set_jmp %r01 loop_end loop
nop ; loads on same cycle as the jump, conditional execution impossible
nop ; loads on same cycle as the jump, conditional execution impossible
nop ; starts loading while the jump is coming out of instruction cache
nop ; starts loading while the jump is coming out of instruction cache
nop ; starts loading while the jump is coming out of instruction cache
nop ; starts loading while the instruction aligner forms a 3 instruction group
nop ; starts loading while the instruction aligner forms a 3 instruction group
nop ; starts loading while the instruction aligner forms a 3 instruction group
nop ; starts loading while the register renamer is mapping %r01
nop ; starts loading while the register renamer is mapping %r01
nop ; starts loading while the register renamer is mapping %r01
jmp2target ; earliest instruction that could execute conditionally in the MOST optimistic case imaginable (but real CPUs have way many more front-end latency cycles)
If the conditional value doesn't depend on memory, and it doesn't depend on results very low down the computation chain, then this could theoretically work - for instance this could probably work on loop indexing (if LLVM can hoist the conditional all the way up the SSA), especially if the branch predictor serves as a fallback.
I guess this could also sorta make sense for floating point or SIMD conditional branches (which often have scheduler penalties and slow value hoisting paths).
Also, you don't need to calculate the branch target early (that can be supplied by the branch target predictor), only the conditional value. So maybe instead of a special jump, you could simply do this by letting the branch predictor have some special port that can read flag values early, which means that it could get 100% prediction if the flag values have been "cold" for long enough when it reaches the jump.