Emulating multiple output instructions with caching

csdt · Post by **csdt** » 2018-04-13, 10:01:26

Hi Agner,

Some operations have naturally multiple outputs: like multiplication (low part and high part), division (quotient and remainder), addition (result and carry)...
As far as I understand, you want to avoid multiple outputs to simplify the hardware.

What if those multiple outputs were single output of multiple instructions:

Code: Select all

int64 r2 = mul_lo r0 r1
int64 r3 = mul_hi r0 r1

but to avoid the recomputation, we would have a "cache" within the ALU/FPU (or anywhere else) to compute this one single time.
I think this would be very interesting for divisions. And overflowing arithmetic.

The most interesting thing I can see here is: the architecture specification keeps simple having only single output instructions, but allows some implementations to optimize those pair of instructions by avoiding recomputation.

What do you think?

Post by **agner** » 2018-04-13, 10:43:43

The ForwardCom instruction set has 'mul' instructions which give the low part of a product, 'mul_hi' gives the high part, and 'mul_ex' gives double-size products as a vector. Vector elements with even-numbered position (0, 2, 4, ..) contain the low parts while odd elements (1, 3, 5, ..) contain the high parts. 'div_ex' gives the quotients in even-numbered vector elements and the remainders in odd-numbered elements. 'add_c' has the sum in even-numbered vector elements and the carry in odd-numbered elements. In other words, multiple outputs are returned as vectors in a single register.

Your proposal is possible, but it would be quite complicated to implement in hardware because the cache has to keep track of the two input registers and check if they have been changed since the first instruction.

csdt · Post by **csdt** » 2018-04-13, 13:28:40

What I like with the "caching" approach: it is completely transparent.
If it is not implemented by the hardware, the software still behaves exactly the same (only slower).
And it does not require any cross-lane communication.
After some thinking, such a caching would be beneficial only for slow operations like division (and multiplication maybe?)

That might be personal taste, but I'm not a big fan of using odd and even indices in a vector to have different meaning.

Back on the add with carry, is it really necessary to have a single instruction to get both the result and the carry?
It might be worth considering recalculating the sum to get the carry. An integer addition is pretty fast, so recalculating it would not incur too much overhead.
And at this point, would it be interesting to not have saturating instructions, but a set of instructions to do saturated arithmetic?

Let me introduce 3 instructions and give some examples:
add_hi src1 src2: compute (src1 + src2) >> width(src) (just compute the carry in fact
overflow src1 src2: if src1 is zero, return src2; if src1 is positive, return MAX_INT; if src1 is negative, return MIN_INT
overflow_u src1 src2: same as overflow for unsigned integers

If you need to check for overflow, just check for add_hi or mul_hi being non-zero
If you want to add 2 int128:

Code: Select all

// int128 A = r0 r1
// int128 B = r2 r3
int64 r4 = add(r0,r2)
int64 r6 = add_hi(r0,r2)
int64 r5 = add_add(r1,r3,r6)
// int128 R = r4 r5

If you want saturating addition (for multiplication, replace add and add_hi with mul and mul_hi):

Code: Select all

int64 r3 = add(r0,r1)
int64 r4 = add_hi(r0,r1)
int64 r2 = overflow(r4,r3)

This approach requires more instructions being executed to get the job done, but much fewer instructions in the ISA:
There is no need to have overflow check instructions, nor saturating instructions, nor *_ex instructions.

I assume those features are rare enough that the overhead has almost no impact on performance.

Would that be acceptable?

HubertLamontagne · Post by **HubertLamontagne** » 2018-04-13, 17:09:54

Afaik, most multiplications in 64bit code are 32x32->32 (int * int), 32x32->64 (int * sizeof(X) for indexing), 32x64->64 (int * size_t) or 64x64 (size_t * size_t). 64x64->128 is very rare, except in code processing ultra-larger integers (like finding that 2^77232917 − 1 is a prime). And a lot of code that needs to calculate multiplications of large numbers uses 64bit floating point instead. So I don't think it's a problem if doing a 64x64 -> 128 multiply eats up 4 regfile read port cycles instead of 2, and occupies 2 slots in the multiplier pipeline instead of 1 (it needs 2 register write port cycles in every case).

Post by **agner** » 2018-04-13, 19:03:49

csdt wrote:

Back on the add with carry, is it really necessary to have a single instruction to get both the result and the carry?
It might be worth considering recalculating the sum to get the carry. An integer addition is pretty fast, so recalculating it would not incur too much overhead.

That's possible. It would have to be a 3-input instruction because we need carry-in and carry-out.
To calculate (C + carryout) = (A + B + carryin) we would need:

Code: Select all

int64 C = add_add(A, B, carryin)
int64 carryout = add_carry(A, B, carryin)

I have made the 3-input add_add instruction optional because it may or may not be possible to implement it efficiently in hardware. The proposed add_carry instruction would probably have to be optional too.

Your two-instruction solution would work for extended multiplication, add with carry, and add with overflow check, but not for extended division.

csdt · Post by **csdt** » 2018-04-14, 8:47:03

Actually, the multiple instruction scheme is also applicable to the division (just call div and rem)
The only thing: in that case it will be slow.
That's were I started to think about caching to not pay for the second divsion (required for rem).

I wonder: is there any performance critical codes that require to compute div and rem for the same inputs?
I can think of several use-cases, but they are not performance critical...
And with such an interface, it is possible to implement caching in hardware to avoid the double computation if it is considered worth it, without breaking binary compatibility.

On the carry, that's true I forgot you also need add_add_hi (or add_carry as you called it).
Is it really difficult to implement 3 input instructions?
Isn't it taken for granted nowadays?

Post by **agner** » 2018-04-14, 9:23:01

csdt wrote:

the multiple instruction scheme is also applicable to the division (just call div and rem)

Yes, but not to extended division (divide a 64 bit integer by a 32 bit integer to get a 32 bit quotient and a 32 bit remainder).

is there any performance critical codes that require to compute div and rem for the same inputs?

Radix conversion.

forwardcom forum

Emulating multiple output instructions with caching

Emulating multiple output instructions with caching

Re: Emulating multiple output instructions with caching

Re: Emulating multiple output instructions with caching

Re: Emulating multiple output instructions with caching

Re: Emulating multiple output instructions with caching

Re: Emulating multiple output instructions with caching

Re: Emulating multiple output instructions with caching