More work per instruction

ForwardCom can do more work per instruction than pure RISC systems without reducing the instructions-per-clock throughput. This is possible because of the flexible instruction format.

A simple example illustrates this. Assume that we want to calculate X = A + B, where A is a variable in static memory and B is an integer constant. We want to store the result X in a register for further processing. A typical RISC system, such as ARM, requires six instructions for doing this. The size of a RISC instruction is usually 32 bits. Such an instruction cannot contain a 32-bit address or a 32-bit constant because then there is no room for the instruction code. So you need to split a 32-bit value into two 16-bit values and use two RISC instructions for loading it into a register. The ARM compiler generates two instructions for loading the 32-bit address of A, one instruction for relative address calculation, two instructions for loading the 32-bit constant B, and one instruction for doing the addition. Other RISC architectures use 4 – 8 instructions for the same code. The X86 CISC instruction set uses two instructions. ForwardCom uses one or two instructions depending on the address size.

The main disadvantage of RISC systems is the limited amount of information that can be contained in a 32-bit code word. CISC systems like X86 do not have this limitation because the instruction size is variable. An X86 instruction can use anywhere from one to fifteen bytes. It is very complicated to determine the size of an X86 instruction. This is a serious bottleneck for a modern microprocessor that can execute several instructions per clock cycle. It is difficult to decode multiple instructions per clock cycle because you need to detemine where one instruction ends before you can decode the next one. ForwardCom avoids this problem by allowing only few different instruction sizes and making it easy to determine the size of an instruction. The size of a ForwardCom instruction is determined by just two bits in the first code word.

The flexible instruction size makes it possible to use long instructions if large constants, addresses, complex addressing modes, extra option bits, etc. are needed, while shorter instructions can be used in simpler cases. Instructions can contain memory operands and immediate constants in a single instruction that does some calculation, for example addition. X86 instructions can contain memory operands with different addressing modes, while there is limited possibilities for including constants. X86 can contain immediate constants in integer instructions, but not in vector instructions. It cannot contain floating point constants. ForwardCom instructions can contain both integer and floating point constants, and constants can be compressed in various ways to make instructions shorter.

There are many other ways that ForwardCom instructions can do more work per instruction than just a single operation. Integer instructions can be predicated with the use of an extra register containing a boolean value called mask. The destination register will receive the result only if the mask is true. It will receive a fallback value if the mask is false. The fallback value may be a register value or zero. Vector instructions can be predicated in the same way on a per-element basis so that each vector element can be enabled or disabled based on a boolean vector.

Instructions that generate a boolean output, such as compare or bit test, can have an extra boolean operation (AND, OR, XOR) on the result. The rationale for this feature is that boolean results are often used in further boolean operations. ForwardCom allows a compare instruction and a subsequent boolean operation on the result to be joined into a single instruction. This feature is very cheap to implement in hardware because it is just a simple boolean operation with 1-bit operands. Yet, this feature makes it possible to replace two instructions with one in many cases.

Compare instructions can also be combined with branching. ForwardCom does not have a flags register or status bit to contain the result of a comparison or other calculation to use in a subsequent branch instruction. Instead, it has combined arithmetic or logic instructions with branch operations into single instructions such as compare-and-branch-if-above, or add-and-branch-if-overflow. Various kinds of loops can be implemented very efficiently with such instructions. A count-down loop can be implemented with a subtract-and-jump-if-positive instruction. A count-up loop or for-loop can be implemented with an instruction that increments a counter and jumps back if the value is below a specified limit. A vector loop can be implemented with an instruction that subtracts the maximum vector length from a register that conts down the number of remaining vector elements and jumps if positive.

Multiway branches are implemented with an instruction that reads from an indexed table of relative addresses and makes a jump or call to the indicated address. This is useful for switch/case statements and function tables.

The guiding principle for designing instructions that do multiple things has been that the complexity is limited to what fits into a general hardware structure and pipeline design without clumsy extra patches. Each instruction should have no more than three input operands and a mask, and no more than one output operand. Complex instructions that need microcode are avoided because microcode has turned out to be inefficient in other microarchitectures.

An exception is made for the push and pop instructions with multiple registers. These instructions are sufficiently useful for justifying the extra complexity. The current soft-core implements push and pop instructions by generating multiple micro-operations in the decoder. The number of such complex instructions should be kept at a minimum in order to limit the complexity of the decoder.