I recently stumbled upon this interesting, pretty old paper:
https://dl.acm.org/doi/pdf/10.1145/1162690.1162694
In short, the paper proposes storing separate descriptors for instruction blocks. They replaced a branch target buffer with a cache for these descriptors, and the front-end of the processor core uses them to control the program flow.
This makes control flow prediction much more independent from the instruction cache and simplifies the identification of branches, which in turn leads to less training time for predictors and higher prediction throughput and accuracy.
I think that the major part of those benefits can be achieved purely through a hardware implementation, but it would still simplify doing so, which is why I wanted to discuss it here.
Decoupling control flow from program code
Moderator: agner