Proposal to drop tiny instructions

Post by **agner** » 2020-05-14, 12:58:47

At present, ForwardCom has four different instruction sizes. Most instructions use a single 32-bit code word. Instructions that need extra bits for constants, extra registers, extra option bits, etc. can use two or three 32-bit words. There is also a tiny instruction format. Two tiny instructions can be packed into a single 32-bit code word. A single tiny instruction will use 32 bits if there is no other tiny instruction that it can pair with. The tiny instructions were planned to compete with ARM Thumb instructions with no need for switching between modes.

When I make code for ForwardCom, I have noticed that most of the tiny instructions go unpaired. The situations where we have two consecutive tiny instructions that can be paired, are actually quite rare, except for the push and pop instruction pairs.

It is a design decision that no ForwardCom instruction can have more than one register output. This makes it easier for the hardware to check interdependencies, out-of-order scheduling, and to avoid the need for an extra result bus. A traditional POP instruction has two outputs: the stack pointer and the register that is popped off the stack. Therefore, the POP instruction was split into two tiny instructions that output one register each. PUSH instructions were split in a similar way to reduce complexity.

Now I am thinking that if the decoder can split a tiny instruction pair into two instructions, it might as well be able to split a word-size POP instruction into two micro-operations (µops). The extra complexity needed to handle the tiny instruction format is actually more than the complexity needed to split a POP instruction into two µops. And, while we are at it ... If a POP instruction can generate two µops, it may as well generate more than two µops. A word-size POP instruction has enough bits to specify a range of registers to pop, which register to use as stack pointer, and whether to grow the stack upwards or downwards. And similarly with push instructions. These multi-register PUSH and POP instructions may be useful for saving and restoring all registers during a task switch, as well as for multiple stacks and for saving data in general while incrementing or decrementing a pointer. I don't want to use microcode for this. Microcode is slow and inefficient. I want all complexity to be handled in hardware by state machines rather than by microcode. This limits the possible complexity of instructions, but makes the hardware more efficient.

If we decide to drop the tiny instruction format and instead allow more complexity for PUSH and POP instructions, we get the advantage of a more consistent instruction format template. The code space that is currently reserved for tiny instruction pairs can be freed to use for other purposes. In particular, the overcrowded instruction format 1.3B/1.3C could be divided into two formats.

There may be a problem with longer response times for hardware interrupts if the CPU is busy pushing a range of large vector registers. On the other hand, we have the advantage that we need fewer instructions to save all registers during an interrupt.

HubertLamontagne · Post by **HubertLamontagne** » 2020-05-15, 17:23:58

If tiny instructions are rare and are mostly used by register saving/loading to stack, and load/store multiple or load/store pair (the ARM64 equivalent) is simpler to implement, then it makes total sense to go for the multi-register loads/stores instead yeah.

I think the extra complexity in the register renamer and retirement unit from multiple-register-write instructions is still less than all the complexity from the memory unit anyways, so I think multiple-uop operations are all ok, as long as they are useful.

Post by **agner** » 2020-05-16, 5:18:17

I think it was you who advised me four years ago to avoid multi-output instructions. Now that I have started to work with a soft core I can see that you were right. It is good to reduce the number of register write ports and result buses. But splitting an instruction into multiple µops appears to be practicable in simple cases.

Without the tiny instructions, we need 32-bits of code space to do simple things like zeroing a register. But on the other hand, we can do more things with a single instruction. How many ARM thumb instructions do you need for loading an operand from a 32-bit address when each instruction has only 8 bits of space for constants? A double-size ForwardCom instruction can calculate the address of an array element with base address and index and make an arithmetic operation on the loaded value. You cannot do the same with four 16-bit thumb instructions, and certainly not with a throughput of one operand per clock cycle.

HubertLamontagne · Post by **HubertLamontagne** » 2020-05-20, 19:48:25

I guess this is about the limit of where I can help, because there are like half a dozen styles of pipelines (single-isssue, atom-style load-alu single issue, dual-issue, VLIW, simple OOO where every load/store/ALU op is a full independent micro-op, OOO with micro-op fusion, OOO with clustering), and what's optimal for tiny-instructions and multi-uop instructions changes between each style (and I'm not sure which types you want to prioritize). It seems to mostly affect the instruction cache size and the register file port utilization, and those are pretty specific to the underlying technology.

Post by **agner** » 2020-05-21, 18:16:44

The decision has been made now. I have removed the tiny instructions and updated all documentation and tools according to this change. New push and pop instructions are added. This change allowed me to get rid of a lot of complexity.

This is version 1.10.

Kulasko · Post by **Kulasko** » 2020-06-04, 12:27:18

As there already needs to be some logic for emitting multi-register push and pop instructions, wouldn't it make sense to add other multi-register instruction versions, for example for a move?
To be honest, I didn't think much further than this up to now, but a multi-register move could certainly make the code more compact in some places.

Post by **agner** » 2020-06-05, 5:11:35

Regarding multi-register instructions: We should not have too many such instructions, because they are complicated to implement in the decoder. I have thought about instructions for zeroing many registers or clearing many vector registers. If you need a multi-register move, you could probably use a vector register instead.

Kulasko · Post by **Kulasko** » 2020-06-09, 18:50:48

Is the logic for multi-register push and pop too specific to be shared?
I thought that perhaps, the multi-register logic that already is needed because of this could be used as a more general multi-register instruction generator. That would enable multi-register instructions without much added complexity.
If this is not feasible, I agree that there shouldn't be too much such instructions.

Post by **agner** » 2020-06-10, 5:00:40

The x86 instruction set has many complex instructions with multiple output registers and multiple µops. Such complex instructions are implemented as microcode in both Intel and AMD processors. They are quite slow because the normal pipeline flow is interrupted while code is being fetched from a microcode ROM. This is what I want to avoid. If there are only few instructions that need multiple µops, then they can be implemented by making a state machine that produces the µops directly in the decoder. This is the solution I am working on right now. I want as few complex instructions as possible because the decoder would become too big.

I am working on a soft core with the following limitations: The register file for the general purpose registers has four read ports: three for operands and one for mask. The register file for vector registers has the same. This means that you cannot have an instruction like mul_add(r1, r2, [r3+r4]), mask = r5, but you can have mul_add(v1, v2, [r3+r4]), mask = v5.

Speaking of complex instructions, I think it is possible to make an instruction for multiple-precision arithmetic that does an addition of big numbers consisting of all the bits in a large vector register. It can be implemented with a single µop and a latency of two clock cycles, one clock for adding sets of 64-bit numbers, and one for sorting out the generate/propagate carry logic. There is a problem if we want a carry out in a separate output register.

HubertLamontagne · Post by **HubertLamontagne** » 2020-06-10, 15:38:24

Kulasko wrote: ↑2020-06-09, 18:50:48 Is the logic for multi-register push and pop too specific to be shared?
I thought that perhaps, the multi-register logic that already is needed because of this could be used as a more general multi-register instruction generator. That would enable multi-register instructions without much added complexity.
If this is not feasible, I agree that there shouldn't be too much such instructions.

I imagine that in a typical out-of-order CPU, the multi-register operations would only really exist for the decoder (and retirement logic) with the goal of making the code smaller on average (= basically increasing your instruction cache size), and they would decode into the same sequence of multiple micro-ops as if you did it with individual register write/stores along with a stack pointer addition. It cannot be a single micro-op because each micro-op can only write to one register. Issuing the multiple micro-op instructions still takes multiple cycles, because your register renaming logic can only rename a fixed number of registers per cycle, and your instruction queues to load/store units and arithmetic units can still only queue up a fixed number of new micro-ops per cycle.

For other code that uses multi-register operations, I imagine that in cases where it really counts, you'd be doing math-heavy code. Math-heavy code typically involves loops that fit in the instruction cache, so there's not much real downside to using multiple operations there. In fact, having multiple separate instructions is probably preferable because instead of hammering the load/store unit with a bunch of operations all at the same time while not generating any work for the arithmetic unit, you're probably going to end up with a more balanced workload that can be optimally queued to both the load/store and arithmetic units.

Kulasko · Post by **Kulasko** » 2021-11-15, 10:08:38

By random browsing, I just stumbled upon this paper on compressed RISC-V instructions:
https://people.eecs.berkeley.edu/~krste ... man-ms.pdf

I don't think any of the specific results are of too much interest for ForwardCom, but the data-driven methodology make me think.
RISC-VC seems to be the only compressed instruction set used by modern hardware, as ARM Thumb-2 isn't supported in AArch64, at least to my knowledge. The most remarkable feature of it is trading the general word alignment requirement in code for the ability to freely mix them with regular-sized instructions, providing a net benefit.

I would like to revisit the topic of compressed instructions someday, when we acquired more data about the instruction mix of different kind of ForwardCom programs, to select instruction candidates for compression. Instruction cache performance and throughput to the decoder can still be a bottleneck in CPUs and companies go to great lengths to push this limit further out. The best current example is Apple with their Firestorm cores, boasting 192 kilobytes of L1 instruction cache.

However, being able to reasonably easily extend ForwardCom with tiny instructions again would require reserving code space for it now. I am not sure if that trade-off is worth it.

Post by **agner** » 2021-11-15, 15:18:37

Thanks for the link.

If you want to mix half-size (16 bits) with word size (32-bits) instructions then all addresses would need 16-bit granularity. Adding one bit at the bottom means removing one bit at the top. In other words, the length you can jump with an 8-bit address offset is halved. Furthermore, instructions need more bits to indicate the instruction size.

The early proposal for tiny instructions in ForwardCom had the requirement that tiny instructions must be paired so that address pointers still had 32-bit granularity, and you can't jump to the second tiny instruction in a pair. I dropped this idea because of the extra complexity in the decoder, and because most tiny instructions went unpaired anyway.

I think you can gain more by doing more work per instruction. For example, ForwardCom can do the following in single instructions:
[*] vector operations
[*] conditional execution predicated by mask
[*] calculating an array element address + loading data + doing calculation on data
[*] arithmetic instruction + conditional jump depending on result
[*] read relative jump address from list + convert to absolute address + jump or call to this address
[*] compare operands + boolean operation on result

forwardcom forum

Proposal to drop tiny instructions

Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions

Multi-register instructions

Re: Proposal to drop tiny instructions

Re: Multi-register instructions

Re: Proposal to drop tiny instructions

Re: Proposal to drop tiny instructions