ForwardCom

Proposal for a forward compatible instruction set architecture

Contents:

Introduction
Highlights
A flexible instruction format
Variable length vector registers
A new efficient type of loops
Efficient memory management
Security features
Visions for the application of ForwardCom
Current status
Resources

Introduction

ForwardCom is a project for development of a new open instruction set architecture and the corresponding hardware and software standards for high performance microprocessors. The intention is to make experiments and investigate what an ideal computer architecture may look like and to develop a complete computer system that is more efficient than the currently prevailing systems, such as x86, ARM, etc. Starting from scratch and making a complete vertical redesign allows us to learn from the history of past mistakes and get rid of the heritage of old quirks that hamper contemporary systems.


Highlights


A flexible instruction format

The ForwardCom instruction set is based on a consistent and flexible modular format suitable for fast superscalar processors. Each instruction uses one, two, or three 32-bit words. It is possible to add still longer instructions for application-specific purposes. Often-used instructions can also be coded in a tiny format, where a 32-bit instruction word contains two tiny instructions. Tiny instructions are always paired.

A simplified sketch of the instruction format is shown here:

instruction templates

The basic instruction word is 32 bits, divided into the following fields:

Instruction length
Tells whether the instruction uses one or more 32-bit words.
Mode
Tells which template is used, what the different fields are used for, whether the instruction uses general purpose registers or vector registers, whether there is a memory operand, and which addressing mode is used.
Operation
Tells which instruction to do. There can be up to 64 multi-format instructions. A multi-format instruction can have many different formats, instruction lengths, and addressing modes. In addition, there can be a large number of single-format instructions. One operation code in ForwardCom corresponds to multiple different operation codes in other systems because it can have several different operand types, register types, vector lengths, masks, addressing modes, etc.
Destination register
There are 32 general purpose registers and 32 vector registers. The register specified in this field is used for the destination (output) of the instruction. The same register is also used as source (input) if there are not enough source registers in the other fields.
Operand type
The operands can be 8-bit, 16-bit, 32-bit, and 64-bit integers and single and double precision floating point numbers. There is optional support for 128-bit integers and quadruple precision floating point numbers, and limited support for half precision floating point. One bit of the operand type field is used for other purposes when general purpose registers are used.
Source register
There can be one source register when template B is used or two source registers when template A is used. Instructions with double length can have three source registers. These can be general purpose registers or vector registers. They can also be used for memory pointers, array index, or vector length.
Mask
A register can be used as a mask or predicate to enable or disable the operation and to specify various options. Masks are particularly useful for vector operations where an operation can be enabled or disabled for each vector element separately.
Data
Data fields can be used for immediate operands and for relative addresses. Instructions with double length can have 32-bit data fields. Instructions with triple length and 64-bit data fields are optionally supported. Data fields can contain integer or floating point numbers or option bits.


Variable length vector registers

Vector registers are used for handling multiple data simultaneously. The computer systems that are commonly used today have vector registers with fixed lengths. Every time a new CPU model with longer vectors comes on the market, the software has to be recompiled using a new instruction set extension that supports the new vector size. Software developers have to develop a new version of their software every time a new CPU model comes on the market, and they have to maintain and support several different versions of their software for the different CPU models if they want to use all CPU models optimally. This is so expensive that it is hardly ever done. Most of the software that is sold today is optimized for CPU models that are already obsolete.

A further problem with current designs is that it is impossible to make your software save a vector register in a way that will be compatible with future extensions of the vector length, because the instructions for doing so have not yet been defined.

The need to solve these problems was a strong motivation for developing ForwardCom. The ForwardCom architecture has variable-length vector registers. The software can use the maximum vector length supported by the CPU it is running on, or it can specify any vector length less than this. The length of a vector register is stored in the register itself. This is useful when a vector register is saved to memory and you don't want to save more data than the register actually contains. It is possible to make software that automatically uses the maximum vector length that the CPU supports, even if this vector length was not supported at the time the software was written. This is what we call forward compatibility.

The variable-length vector registers can be used in a new and very efficient type of loops that automatically uses the optimal vector length. This is described in the next section.


A new efficient type of loops

Let's consider a simple loop that does something with an array of 10 floats. It may look something like this:

float my_array[10];
for (int i = 0; i < 10; i++) {
do_something(my_array[i]);
}

A simple implementation will use i as an index relative to the start address of the array while counting i up to 10, and load one element at a time into a register:

simple loop

A vector implementation in a current systems will load a number of consecutive array elements, e.g. four, into a vector register, and increment i by four for each iteration of the loop:

vector loop

In this example, the loop will iterate two times and handle four array elements in each iteration. There are two remaining elements in the end because the length of the array is not divisible by the vector length. These remaining elements must be handled separately outside the loop.

The ForwardCom system can make this loop in a more efficient way. We are using a backward index from the end of the array. The backward index counts down from 10 so that it always contains the remaining number of array elements to handle. The backward index is also used for specifying the desired vector length. If we ask for a longer vector than the CPU supports, then we will automatically get the maximum vector length. In this example the maximum length is four elements. In the first iteration we ask for ten elements and get four. The backward index is now decremented by four. In the next iteration we ask for six elements and get four. In the last iteration we ask for two elements and get two.

loop with variable vector length

This method has several advantages. First, we don't need any extra code to handle the remaining array elements if the array length is not divisible by the vector length. And second, it adjusts automatically to the maximum vector length of the CPU it is running on. If we run the same code on a CPU with a maximum vector length of 8 then the loop will run two iterations, handling 8 elements in the first iteration and 2 elements in the second iteration. If the maximum vector length is 16 then the loop will run only one iteration with a vector length of 10 elements.

The ForwardCom instruction set has a special addressing mode to support this loop method. It has a memory operand with a pointer register containing the end address and a backward index register that is subtracted from this pointer. A vector memory operand always uses an extra register to specify the length of the vector. We can use the same register for backward index and vector length, because we will get the maximum vector length when the specified length is more than the maximum length.

The loop may contain function calls. Assume, for example, that the code in our example involves the calculation of the logarithm of each vector element. The logarithm function is contained in a standard math function library. Now, this function uses a vector register for input and a vector register for output. The information about the vector length is contained in the vector register itself. Therefore, the logarithm function can handle a vector of any length and calculate the logarithms of all vector elements simultaneously. A scalar (single element) parameter is simply handled by the function as a vector with one element. This makes it easy for an optimizing compiler to convert scalar code to vector code, even if the code contains function calls.


Efficient memory management

The ForwardCom system includes standards for the application binary interface (ABI), binary file format, memory organization, etc. These standards are designed so that memory fragmentation can be avoided, or at least minimized. A typical running application will have only three memory blocks: program code, read-only data, and read/write data (including static data, stack and heap). This makes memory management more efficient. The number of memory blocks that a running process or thread has access to is so small that it all can be contained in a memory map inside the CPU chip. This is very different from most common systems that have very large page tables. A large page table requires fixed-size memory pages in order to make table lookup simple. But if we can keep the number of table entries small then it is feasible to have variable-size table entries. The ForwardCom design has the goal of keeping all code or data that a process has access to contiguous and to avoid memory fragmentation as much as possible. This may make it possible to replace the huge multi-level page tables and translation-lookaside-buffers of current systems with a small on-chip memory map. Each process and each thread has its own memory map.

Some of the techniques that are used for keeping data contiguous are:


Security features

Security is an integral part of the hardware and software design. This includes the following planned features:


Visions for the application of ForwardCom

ForwardCom will not readily replace the commonly used systems, even if it is better, because the users need compatibility with existing hardware and software. However, the development of an ideal instruction set architecture and a complete redesign of the ecosystem of hardware and software standards is a worthwhile exercise in itself which may produce useful results and unexpected new discoveries. This project has already generated so many valuable ideas that it is worth pursuing further.

Let's assume that the need for a new instruction set will arise in the future, for whatever reason. Then it will be good to have a ready proposal that has been through a long development process rather than starting from scratch with a limited time budget and end up with a suboptimal solution. An open ongoing development process with inputs from anybody interested is likely to generate better results than the usual closed industry process with its short-term commercial priorities.

ForwardCom may, for example, be useful for the following purposes:


ForwardCom will also be useful as a sandbox for university projects and experiments with new ideas such as:


Current status of the ForwardCom project

The ForwardCom project is in a stage of development. The basic instruction set architecture has been designed and a complete set of application-level instructions is defined. Some system-level instructions are not fully developed yet.

The structure of the binary file format for object files, function libraries, and executable files has been defined.

The fundamentals of application binary interface standards (ABI), memory management standard, etc. have been defined.

A high-level assembler and disassembler has been developed. An emulator has not been developed yet.

After this, we can start to make implementations in FPGA and ASIC chips.


Resources

Comparison of ForwardCom with other instruction sets

Complete manual for ForwardCom

Public repository on Github

Discussion forum for ForwardCom development

Agner's optimization resources, mainly for x86 microprocessors

15773

By Agner Fog, 2017.