ForwardCom

An open forward-compatible instruction set architecture

Introduction

ForwardCom is an experimental project for the development of a new open instruction set architecture and the corresponding open hardware and software standards for high performance microprocessors and vector processors. The purpose is to investigate what an ideal computer architecture may look like and to develop a complete open source and open license computer system that is more efficient than the currently prevailing systems, such as x86, ARM, etc.

ForwardCom is used as a sandbox for experimenting with various ways of improving performance when we are not constrained by the need for compatibility with existing hardware and software.

ForwardCom is also useful as a high-end alternative to RISC-V and other open hardware architectures. Starting from scratch and making a complete vertical redesign of the hardware-software ecosystem allows us to learn from the history of past mistakes and get rid of the heritage of old quirks that hamper contemporary computer systems.

Highlights

The ForwardCom instruction set is neither RISC nor CISC, but a new paradigm with the advantages of both. ForwardCom has few instructions, but many variants of each instruction. This makes the code more compact and efficient with more work done per instruction.
The ForwardCom design is scalable to support small embedded systems and system-on-a-chip designs, as well as large supercomputers and vector processors without losing binary compatibility.
The instruction set is fully orthogonal. The same instruction can be coded with integer operands of different sizes and floating point operands of different precisions. The operands can be scalars or vectors of any length. Operands can be registers, memory operands with different addressing modes, or immediate constants.
Vector registers of variable length are provided for efficient handling of large data sets.
Array loops are implemented in a new flexible way that automatically uses the maximum vector length supported by the microprocessor in all but the last iteration of a loop. The last iteration automatically uses a vector length that fits the remaining number of elements. No extra code is needed to deal with remaining data and special cases. There is no need to compile the code separately for different microprocessor versions with different vector lengths (see below).
No recompilation or update of software is needed when a new microprocessor with a different vector register length becomes available. The software is guaranteed to be forward compatible and take advantage of the longer vectors of new microprocessor models without recompilation.
Strong security features are a fundamental part of the hardware and software design.
Memory management is simpler and more efficient than in traditional systems. Several techniques are used for avoiding memory fragmentation. This makes it possible to manage memory blocks in an on-chip memory map with a limited number of sections with variable size. The traditional translation lookaside buffers (TLB) and large multi-level page tables can be avoided in cases where it is possible to keep the number of separate memory blocks small. All code is position-independent.
There are no dynamic link libraries (DLLs) or shared objects. Instead, there is only one type of function libraries that can be used for both static and dynamic linking. Only the part of the library that is actually used is loaded and linked. The library code is kept contiguous with the main program code in almost all cases. An executable file can be relinked to update a function library or to adapt the program to a particular hardware configuration, operating system, or user interface framework.
A mechanism for calculating the required stack size is provided. This can prevent stack overflow in most cases without making the stack bigger than necessary.
A mechanism for optimal register allocation across program modules and function libraries is provided. This makes it possible to keep most variables in registers without spilling to memory. Vector registers can be saved in an efficient way that stores only the part of the register that is actually used.
A new system for handling floating point errors is more suitable for vector processing and out-of-order processing than existing systems.
ForwardCom breaks the old tradition of a cryptic assembly language. ForwardCom assembly code looks like C or Java, and is immediately comprehensible.
Standards for software tools, ABI, file formats, system libraries, etc. are defined in order to establish compatibility between different programming languages and different platforms. It is possible to code different parts of a program in different programming languages.
Everything is open source with free licenses, including standards, development tools, and FPGA softcore.

Neither RISC nor CISC

Common RISC designs have a serious limitation in the instruction size. Typically, the instruction size is fixed at 32 bits. This limits the amount of information that can be contained in a single RISC instruction. For example, you need multiple RISC instructions for loading a 32-bit constant or a 32-bit address. You cannot have more than one or a few variants of each instruction because there are not enough bits in the code word to indicate the different variants. Instead, you must use multiple operation codes (opcodes) if you want multiple variants of an instruction.

The most popular CISC design is the x86 instruction set with its many extensions. The CISC instruction set typically has many opcodes for different variants of the same instruction. In fact, there are now more than two thousand different opcodes in the x86 instruction set if you include all the extensions. Decoding x86 instructions is very complicated and a serious bottleneck.

The ForwardCom instruction set is designed to overcome the problems of both RISC and CISC. ForwardCom instructions can have a size of one, two, or three 32-bit words. The length of each instruction is specified by just two bits in the first code word. This makes it possible to decode multiple instructions per clock cycle without the need for a micro-operations cache. Decoding is simple because all instructions fit into the same standardized template system.

Most ForwardCom instructions can be coded in several different formats. This includes compact instructions with a single-word size, as well as double and triple size versions when there is a need for more registers, bigger constant operands, bigger memory addresses, or additional option bits. Common instructions have many different variants with different types of registers, different precisions, constant operands with different sizes, memory operands with different addressing modes, predicate masks for conditional execution, and extra option bits for sign change or other extra features.

It is more efficient to have many different variants of the same instruction than to have different instructions or opcodes for each variant. This makes the hardware simpler because the variants are coded in the same consistent way for all instructions. It also makes the code more efficient because it can do more work per instruction. It is possible to define complex instructions that do multiple things, but only as long as they fit into a standardized template system, pipeline structure, and timing restrictions. This makes sure that even quite complex instructions that are doing multiple things can execute at a speed of one instruction per clock cycle per hardware pipeline. This system is further explained here.

A flexible instruction format

The ForwardCom instruction set is based on a consistent and flexible modular format suitable for fast superscalar processors. Each instruction uses one, two, or three 32-bit words. The bigger formats are used if large constants, large memory addresses, complex addressing modes, or extra option bits are needed. The assembler will automatically select the smallest possible instruction format that fits the specified operands. This template system is further explained here.

Variable length vector registers

Vector registers are used for handling multiple data simultaneously. The advanced computers that are commonly used today have vector registers with fixed lengths. Every time a new CPU model with longer vectors comes on the market, the software has to be recompiled using a new instruction set extension that supports the new vector size. Software developers have to develop a new version of their software every time a new CPU model comes on the market, and they have to maintain and support several different versions of their software for the different CPU models if they want to use all CPU models optimally. This is so expensive for software developers that it is hardly ever done. Most of the software that is sold today is optimized for CPU models that are already obsolete.

A further problem with traditional designs is that it is impossible to make the software save a vector register in a way that will be compatible with future extensions of the vector length, because the instructions for doing so have not yet been defined.

The need to solve these problems was a strong motivation for developing ForwardCom. The ForwardCom architecture has variable-length vector registers. The software can use the maximum vector length supported by the CPU it is running on, or it can specify any vector length less than this. The length of a vector register is stored in the register itself. This is useful when a vector register is saved to memory and you do not want to save more data than the register actually contains.

The variable-length vector registers can be used in a new and very efficient type of loops that automatically uses the maximum vector length, even if this vector length was not supported at the time the software was written. This is what we call forward compatibility.

A new efficient type of loops

Let us consider a simple loop that does something with an array of 10 floats. It may look something like this in C code:

float my_array[10];
for (int i = 0; i < 10; i++) {
    do_something(my_array[i]);
}

A simple implementation will use i as an index relative to the start address of the array while counting i up to 10, and load one element at a time into a register:

A vector implementation in a traditional system will load a number of consecutive array elements, for example four, into a vector register and increment i by four for each iteration of the loop:

In this example, the loop will iterate two times and handle four array elements in each iteration. There are two remaining elements in the end because the length of the array is not divisible by the vector length. These remaining elements are handled separately outside the loop.

The ForwardCom system can code this loop in a more efficient way. We are using a backward index from the end of the array. The backward index counts down from 10 so that it always contains the remaining number of array elements to handle. The backward index is also used for specifying the desired vector length. If we ask for a longer vector than the CPU supports, then we will automatically get the maximum vector length. In this example the maximum length is four elements. In the first iteration we ask for ten elements and get four. The backward index is now decremented by four. In the next iteration we ask for six elements and get four. In the last iteration we ask for two elements and get two.

This method has several advantages. First, we do not need any extra code to handle the remaining array elements if the array length is not divisible by the vector length. And second, it adjusts automatically to the maximum vector length of the CPU it is running on. If we run the same code on a CPU with a maximum vector length of 8 then the loop will run two iterations, handling 8 elements in the first iteration and 2 elements in the second iteration. If the maximum vector length is 16 then the loop will run only one iteration with a vector length of 10 elements.

The ForwardCom instruction set has a special addressing mode to support this loop method. The special addressing mode has a memory operand with a pointer register containing the end address, and a backward index register that is subtracted from this pointer. A vector memory operand always uses an extra register to specify the length of the vector. Here, we are using the same register for backward index and vector length. This works because we will get the maximum vector length when the specified length is more than the maximum length.

The loop may contain function calls. Assume, for example, that the code in our example involves the calculation of the logarithm of each vector element. The logarithm function is contained in a standard math function library. Now, this function uses a vector register for input and a vector register for output. The information about the vector length is contained in the vector register itself. Therefore, the logarithm function can handle a vector of any length and calculate the logarithms of all vector elements simultaneously. A scalar (single element) parameter is simply handled by the function as a vector with one element. This makes it easy for an optimizing compiler to convert scalar code to vector code, even if the code contains function calls.

Security features

Security is an integral part of the hardware and software design. This includes the following features:

A flexible and efficient memory protection mechanism
Separation of call stack and data stack so that return addresses cannot be compromised by buffer overflow
Jump tables and function pointer tables are placed in read-only memory
Features for array bounds checking are built in
Optional methods for checking integer overflow
Each thread can have its own protected memory space, which is not accessible to parent and sibling threads within the same process
Device drivers and system functions have carefully controlled access rights. These functions only have access to a specific block of memory that the calling process chooses to give it access to. A device driver has only access to a controlled range of input/output ports and system registers
Application programs have only access to specific resources as specified in the executable file header and controlled by the operating system
Mandatory standardized procedure for installing and uninstalling programs
There is no “undefined” behavior. There is always a limited set of permissible responses to an error condition

The security features are further described here.

Development tools

The following development tools are currently available: High-level assembler, disassembler, linker, library manager, emulator, and debugger. These tools are described here. A compiler is not available yet.

Softcore

A hardware implementation of a ForwardCom CPU is available as an FPGA softcore. The hardware description code is published with an open license. The current softcore model A supports only integer instructions. The softcore is described here.

Visions for the application of ForwardCom

ForwardCom will not readily replace the present commercial systems, even if it is better, because the users need compatibility with existing hardware and software. However, the development of an ideal instruction set architecture and a complete redesign of the ecosystem of hardware and software standards is a worthwhile exercise in itself which may produce useful results and unexpected new discoveries. This project has already generated so many valuable ideas that it is worth pursuing further.

Let us assume that the need for a new instruction set will arise in the future, for whatever reason. Then it will be good to have a ready proposal that has been through a long development process rather than starting from scratch with a limited time budget and end up with a suboptimal solution. An open ongoing development process with inputs from anybody interested is likely to generate better results than the usual closed industry process with its short-term commercial priorities.

ForwardCom may, for example, be useful for the following purposes:

FPGA soft cores
System on a chip designs
Supercomputers with very long vector registers
Applications where the patent and license restrictions of other architectures would be an obstacle
Applications where the security features of ForwardCom are needed
Niche products where compatibility with older systems is not required
Real-time systems where the efficient memory management and fast task switching of ForwardCom is useful
Applications that need application-specific instruction set extensions
Some of the new ideas generated by the ForwardCom project may be applied to other systems

ForwardCom will also be useful as a sandbox for university projects and experiments with new ideas such as:

Testing the concept of forward compatibility
Testing the efficiency of large vectors, variable vector length, and efficient array loops
Hardware development and research on the compromise between RISC and CISC
Research on control flow decoupling, as discussed in chapter 8.1 of the manual
Custom instructions with on-chip FPGA
Testing the efficiency of memory management with variable-size memory pages with no translation lookaside buffer (TLB) and no large page tables, using various methods for minimizing memory fragmentation
Relinkable executable files as an alternative to DLLs, shared objects, and plugins
Secure design to prevent common software attacks and vulnerabilities, as described here
Experiments with parallel detection of floating point exceptions through NAN propagation as discused here and in chapter 6.3 of the manual
Testing half-precision and quadruple-precision floating point vector arithmetics
Research on metaprogramming. The ForwardCom assembler includes planned and partially implemented metaprogramming features
Calculation of required stack size by the linker
Optimization of register allocation by providing information about register use in object files and library files

Current status of the ForwardCom project

The ForwardCom project is under continuous development.

The basic instruction set architecture has been designed and a complete set of application-level instructions is defined. Some system-level instructions are not fully developed yet.

The structure of the binary file format for object files, function libraries, and executable files has been defined in details.

The details of application binary interface standards (ABI), function calling convention, etc. have been defined.

A high-level assembly language has been developed.

The following binary tools have been developed: high-level assembler, disassembler, linker, library manager, emulator, and debugger.

A hardware implementation in an FPGA soft core is available with full documentation. The current version supports integer instructions only.

License conditions

Website and manual: Creative commons, CC-BY.
Binary tools: Gnu general public license, GPLv3
Soft-core: CERN open hardware licence, CERN-OHL-W

By Agner Fog, 2017 - 2026.

284565