JoeDuarte wrote: ↑2018-06-07, 7:07:20
By the way, the Itanium is looking sweeter than ever given its immunity to Meltdown and Spectre. I feel like the industry needs to
move forward at faster clip.
For Itanium, I'm going to say something controversial here: it's actually a worse architecture than x86 (!).
Itanium's idea is that the compiler is going to schedule instructions beforehand, and that the silicon you're going to save will be used for more execution units. This just doesn't work in practice. Every time a program loads a value from memory, the compiler has to guess from a crystal ball if it's going to come from L1 cache (in which case it schedules dependent instructions close to minimize latency), or if it's going to come from L2+ cache or ram (in which case it needs to buffer in more instructions between to avoid a stall). This decision has to be taken by the compiler and is set in stone and cannot vary dynamically as the program runs.
There's an extra problem: typical code will do a bunch of calculations, then store, then do a second load of some other data item for more calculations etc. To keep the CPU busy, the compiler has to move the second load before the store, but then it needs to add a safeguard that the second load doesn't fall on the same memory address as the store (otherwise you get all sorts of crashes). Because of this, Itanium needs a special second kind of load that locks the memory address, and a special store that checks for locks, and management instructions for the lock table.
A third problem is that maybe your second load needs to be hoisted before a conditional branch. Then you have the problem that the branch might be taken or not, and your load might never really happen in the program logic. This is not normally a problem, except that your hoisted load might trigger a page fault. So Itanium has yet a THIRD special kind of load that doesn't page fault directly, but instead loads a poison value if the load was going to fault. Then all your registers need an extra bit to indicate this poison value, plus management instructions for saving/restoring this (for interrupts etc).
A fourth problem is that now that you have long latency instruction series explicitly scheduled, you get the problem that loops have to be overlapped over more than one iteration (since it takes more time to get the results than to start the next iteration). Since the first and last iteration are partial, you'd end up with long prologues and epilogues, so to help with this, Itanium also has predication - basically, every instruction can be made conditional, so that you can easily run only parts of the loop the first and last time around. And since multiple overlapped iterations need to operate on different values, Itanium tacks on a register rotation engine, that can dynamically rotate the names of a bunch of registers and is also used to automatically spill/refill registers to the stack on function calls.
---
If this is starting to sound complex, that's because it absolutely is. It's basically most of the machinery of an OOO core, but explicitly exposed to the programmer.
And if you try to implement Itanium in an Out of Order core, you run into the problem that all your instructions are conditional, all your instructions can generate an interrupt (because of the poison values), your load instructions don't only load but also poke bits here and there in this memory aliasing table, the register rename system can randomly issue a large blob of memory loads/stores, you have tons of short lived memory address values hammering your register file write ports since it doesn't even have the [register+immediate] addressing mode, plus it has a [register]+postincrement addressing mode which turns memory ops into two-result instructions (= probably has to be split into 2 micro-ops!), plus you have to deal with the possibility of rollbacking the register renamer + the poison bits + the memory aliasing table if a branch goes the other way than predicted.
In other words, Itanium doesn't even have x86's saving grace: that you can throw an oversized instruction decoder and a flags register engine at it and mercifully split all the braindead 80's instructions into actually reasonable micro-ops. :3