Page 1 of 1

One flexible register

Posted: 2018-04-21, 0:07:10
by JoeDuarte
Hi Agner – Is it possible to engineer one 1024- or 2048-bit register that could be almost infinitely subdivided by the programmer into whatever combination of 16, 32, 64, 80, 128, 256, etc. registers were optimal for the program?

I don't know what a register fundamentally is as a physical or silicon entity. Something I'm curious about is the physical engineering difference between:
  • A 128-bit register that operates on four 32-bit operands.
    Four 32-bit registers that each operate on one 32-bit operands.

Re: One flexible register

Posted: 2018-04-21, 6:01:23
by agner
The difference between a vector register and a scalar register is that you can handle the entire vector register with a single operation. If you want to add 1 to four 32-bit registers you need four instructions. If you want to add 1 to all four elements of a 128 bit vector register you only need one instruction. If you want to add 1 to only one of the four elements in a 128 bit register, you make an addition to the entire vector register with a mask that enables only the element you want. So, yes, it is possible to subdivide a 1024 bit register if the hardware has a maximum vector length of 1024 bits. You might actually use this as a kind of array with 32 elements of 32 bits each.

I guess you want to use the 32 elements as independent registers that you can use for unrelated purposes. This is possible in principle, but you get problems with out-of-order processors. The first Intel 8086 processor had 16-bit registers that could be divided into two 8-bit registers. For example:

Code: Select all

MOV AX, 0102H         ; AH = 1, AL = 2
ADD AL,4              ; AL = 6
ADD AH,1              ; AH = 2
MOV BX,AX             ; BX = 0206H
This worked fine until they invented superscalar processors with out-of-order processing. Some superscalar processors treat AL and AH as individual registers that can be operated simultaneously or out of order. But when they are joined together in the last line then you have to wait until the in-flight AL and AH temporary registers retire into the physical register AX before you can access them as one register. This takes several clock cycles. Other hardware implementations keep AL and AH together so that you avoid the penalty when they are joined together, but you lose the advantage of out-of-order processing because you cannot access them independently. This problem was unpredicted in the original design when out-of-order processing was not invented yet. But it keeps causing problems and suboptimal solutions in today's superscalar processors. That's why I designed ForwardCom so that no instruction uses a partial register and leaves the rest of the register unchanged, except for instructions intended explicitly for this.