RISC or CISC?

The debate of RISC versus CISC architectures seems to never end. The debate started in the late 1970's and is still going on today where both RISC and CISC architectures are widely used.

A typical RISC architecture has a fixed instruction size of 32 bits and a limited set of simple instructions that are doing only one thing for each instruction.

A typical CISC architecture has a variable instruction size and a large set of instructions that can do complex things.

A disadvantage of CISC architectures is that the decoding of instructions is difficult. Instructions can have any length from 1 to 15 bytes in the most common CISC architecture – x86. It is very complicated to determine the length of an x86 instruction. This makes decoding a serious bottleneck in x86 processors, especially when it is desired to decode multiple instructions per clock cycle. Modern x86 processors have a micro-op cache after the decoder in order to deal with this bottleneck. Decoding is much more efficient on RISC architectures where all instructions have the same size.

However, the fixed 32-bit instruction size of typical RISC architectures is a serious limitation because it is impossible to contain a 32-bit constant or a 32-bit memory address in a 32-bit instruction when several bits are needed for coding the instruction itself. It requires at least two RISC instructions to load a 32-bit constant into a register. Larger constants are usually stored in memory. It may require up to four instructions to load such a constant from memory: Two instructions for loading a 32-bit relative address, one instruction for converting this to an absolute address, and one instruction for reading the value from this address.

The ForwardCom architecture provides an efficient compromise between RISC and CISC to avoid these problems. A ForwardCom instruction can have a size of one, two, or three 32-bit words. This makes it possible to contain more information in a single instruction code when needed. Decoding a stream of instructions is still simple and efficient because the instruction length is determined by only two bits in the first code word of each instruction. This is so simple that it is possible to decode multiple instructions in each clock cycle without compromising on the clock frequency.

The flexible instruction size makes it possible to include an immediate constant of up to 64 bits within a single instruction and to eliminate the need for storing constants in data memory. Even floating-point constants with single or double precision can be included within a single instruction. Most constants occurring in a typical program are small, simple values. The ForwardCom system is using fewer bits for storing small or simple values than for large random values. For example, an instruction that adds the value 50 to an integer register and stores the result in another register can be coded in a single-word instruction where the constant is stored in an 8-bit data field. Both integer and floating point constants can be compressed in various ways to fit into data fields of 8, 16, 32, or 64 bits. A large integer with many trailing zeroes can be coded as a smaller value with a left shift. For example, 0x1500000 can be coded as 0x15 << 20. Small floating-point values without decimals can be coded as 8-bit signed integers. Simple rational numbers such as 2.5 can be represented exactly with half precision, using only 16 bits. The assembler is automatically representing constants in the smallest possible form in order to minimize the code size.

It is very convenient to be able to include a constant in an instruction. This makes it possible do an arithmetic operation with a register operand and a constant operand with a single instruction without loading the constant from memory, for example r1 = r2 + 1234. ForwardCom can even do this with floating point constants, unlike most other instruction sets.

It is also possible to include a memory operand in a ForwardCom instruction. For example, you can do r1 = r2 + [memory_operand] in a single instruction. Such load-operate instructions are typically available in CISC instruction sets, but not in RISC instruction sets. The advantage of load-operate instructions is that you need fewer instructions for doing the same job. The disadvantage is that the pipeline in the CPU gets longer. A long pipeline does not affect the throughput in linear code, but it can increase the delay after mispredicted branches. This problem can be mitigated for instructions that have no memory operands. An out-of-order processor can send instructions to the execution unit as soon as the operands are ready, while an in-order processor may have an extra execution unit with a shorter pipeline to handle simple instructions with no memory operands.

The RISC principle was originally conceived based on considerations that payed more attention to linear code than to heavy loops (Tanenbaum, 1978). However, a typical program is spending most of its execution time in the innermost loop. If we want to optimize performance, we need to focus on inner loops. Performance-critical programs are often processing large amounts of data in the time-consuming inner loops. Large data sets cannot be contained in registers. Therefore, the hot spot of a performance-critical code is likely to load data from large arrays, do some calculations, and store the results in some other array. We can improve performance considerably by combining a memory load and an arithmetic operation into a single operate-load instruction. This includes instructions with vector registers to process multiple data values simultaneously. There are typically more loads than stores. Therefore, it is more important to have load-operate instructions than operate-store instructions. We still want an efficient streamlined pipeline structure, so we will not have any instructions with more than one memory operand. The ForwardCom architecture has load-operate instructions based on these considerations.

The flexible instruction size makes it possible to include extra bits for selecting different types of operands and different addressing modes for memory operands. Instructions can use a two-word or three-word instruction size if they need a large immediate constant, a 32-bit address, complex addressing modes, extra operands, or extra option bits for other purposes. The smaller single-word instruction size can be used where fewer bits are needed for constants, addresses, options, etc. Most common instructions are multi-format instructions that can be coded in many different variants with all the different features discussed here.

The many different variants should not be too demanding for the decoder. All instructions fit into a consistent template system that makes the decoding process relatively simple and streamlined. The decoder can determine what kind of operand goes where, as indicated by just a few bits that can easily be decoded in a single clock cycle. The advantage of having many variants of each instruction is that coding gets simpler because you can have almost any combination of operand types, addressing modes, integer sizes, floating point precisions, vector lengths, etc. with the same instruction. There is no need to have different instructions for different operand types, different register types, different precisions, etc. This makes both hardware and software simpler.

It is possible to do more work per instruction than in a RISC architecture, but the design of each instruction must be decided judiciously. Extra functionality in an instruction should be implemented only if it fits into the existing template system and pipeline structure. Most instructions have a latency of one clock cycle and a throughput of one instruction per clock cycle per execution unit. Multiplication, division, and floating-point arithmetic have longer latencies. Extra functionality in an instruction should not be implemented if this increases the latency or decreases the throughput.

The ForwardCom architecture is imposing certain restrictions on the instructions that can be implemented in order to enable an efficient and streamlined hardware structure for both in-order and out-of-order hardware designs. No instruction can have more than one memory operand. No instruction can have more than three register inputs and one mask. No instruction can have more than one output register (except push and pop). Complex instructions that need microcode are avoided. New instructions should only be added if the performance gain is significant.

These design principles have made it possible to combine the fast decoding and streamlined pipeline design of RISC architectures with the more work done per instruction of CISC architectures. Furthermore, the orthogonal design gives a free choice of general-purpose registers, vector registers of arbitrary length, immediate constants, and memory operands with different addressing modes for all common instructions. ForwardCom is neither RISC nor CISC, but gets the best of both worlds. It has few instructions, but many variants of each instruction. It can do more work per instruction than a RISC processor, but avoids the excessive complexity of many CISC instructions. This design makes both software and hardware simpler and more efficient.