Load From Const Array Instruction

discussion of forwardcom instruction set and corresponding hardware and software

Moderator: agner

Post Reply
Dom324
Posts: 5
Joined: 2019-08-06, 10:40:57

Load From Const Array Instruction

Post by Dom324 »

Programs contain many Read Only Arrays/Lookup Tables. Currently they are stored in static data memory. To make a load from such an array, CPU needs to calculate the address as pointer + index and load the value from the dcache.
This solution requires a pointer and doesn't exploit locality very well.

It would be more efficient to have a Load From Const Array instruction which would be directly followed by an array of integers/floats, effectively offloading Read Only Arrays from dcache into the icache and embedding them right next to the instruction that accesses them.
Such instruction would have an Operand Size field, immediate describing number of elements in the array, register field specifying the index from which we want to read, and another register field as a destination register.

The load address would be calculated as:

Code: Select all

Instruction pointer + Index * Operand Size
And data would be fetched directly from the icache.

The address of the following instruction would be calculated as:

Code: Select all

Instruction pointer + align_to_4B( Number of elements * Operand Size )
Basically creating an unconditional jump from the beginning of the array to its end.

This instruction omits the need for a pointer to the array and improves locality of Read Only Arrays, because they are stored right next to the instruction that accesses them, making it easier to cache the array.
This should make Lookup Tables, which are already very fast and very useful for optimization, even faster.

It also allows for pressure and capacity trade off between icache and dcache, which compilers or programmers can use to balance the load on the caches.
If a program makes heavy use of dcache, read only arrays can be offloaded to the icache, which decreases dcache pressure and increases it's usable capacity.
If a program makes heavy use of icache, arrays can still be implemented the standard way in order to not pollute icache (it may still be beneficial to offload small arrays).

ForwardCom already employs similar system for constants, citing ForwardCom manual section 1.4:
"The ForwardCom design makes it possible to store constant data in instruction codes instead of constants scattered in static data memory. This reduces cache misses."

This suggestion would also play nice together with the pipeline design from section 8.2 of the manual, which proposes executing simple instructions right in the front end, if their operands are avalaible in the Permanent register file.
If register that holds the index of the array is avalaible in the Permanent register file, then array look up can be completely resolved in the front end, which should be both more energy efficient and have lower latency compared to doing the load in the out of order back end.
agner
Site Admin
Posts: 192
Joined: 2017-10-15, 8:07:27
Contact:

Re: Load From Const Array Instruction

Post by agner »

Thank you for your suggestion. A problem with your proposal is that jumps are costly, especially if the pipeline is long, because they interrupt the prefetching and decoding of instructions. Another problem is that the table needs multiple copies if it is accessed from multiple points in the code. I would prefer to load the table into a vector register, if it fits. In this way, you can access the table many times after just reading it once.
HubertLamontagne
Posts: 80
Joined: 2017-11-17, 21:39:51

Re: Load From Const Array Instruction

Post by HubertLamontagne »

It's an interesting idea, for sure. Though I think it would require a 3rd cache - I don't think it can share the instruction cache:

- The instruction cache doesn't do word addressing. It loads whole 128bit or 256bit aligned chunks, that are then queued into the so-called "prefetch queue" (the part of the CPU that aligns and reads individual instructions from the chunky stream).
- The table access would have to steal access cycles from the instruction cache. This means you'd have to have a source address selector (instruction-pointer/branch-predictor or table-access), a special stall condition for the prefetch queue (so that it waits out the missed cycle).
- If the instruction cache hits an illegal-access exception, it has to figure out if the illegal access was caused by accessing the instruction stream (in which case the exception happens more or less immediately), or if it was caused by a lookup-table-access, in which case the faulting instruction has happened quite some time ago and the CPU needs to rewind all the operations that happened in the meantime (on an out-of-order core).

Doing this with a separate read-only-data-cache could achieve similar benefits:

- You could use IP-relative addressing to locate the data (leading to something like instructions like "loadreadonly R24, int32 [IP + 0x308 + R3 * 4]").
- You wouldn't need to jump because the lookup-table would be located after the function body instead of inside it.
- You wouldn't cause a 1-cycle stall on instruction issue because you wouldn't need to steal an instruction-cache cycle.
- The instruction cache would not get any more complex because it would only load instructions.
- If the read-only-data-cache is separate from the data cache, then you wouldn't have to use up a data-cache cycle either.
- Because the read-only-cache is read-only, it doesn't have to go through the memory-ordering-buffer and can happen truly out of order.
- Any kind of data could be put in the read-only-data-cache, not just lookup-tables in code, as long as you signal to the OS that your data block is read-only (so that it can purge the data from the ordinary data-cache).

In mainstream CPUs, they've dealt with this by adding more ports to data-cache, with 2-read-1-write caches and now even 3-read-1-write caches (but that design might be reaching its limits).
agner
Site Admin
Posts: 192
Joined: 2017-10-15, 8:07:27
Contact:

Re: Load From Const Array Instruction

Post by agner »

The solution that Hubert proposes is already possible with the present design. ForwardCom supports a separate read-only data section addressed relative to IP. An addressing mode with [IP + offset + scaled index] is also supported. I don't remember if we have discussed this before, but it is certainly possible to make a separate cache for the read-only data section.
Post Reply