ForwardCom
Proposal for a forward compatible open
instruction set architecture
Contents:
Introduction
Highlights
A flexible instruction format
Variable length vector registers
A new efficient type of loops
Efficient memory management
Security features
Development tools
Visions for the application of ForwardCom
Current status
Resources
Introduction
ForwardCom is a project for development of a new open instruction set
architecture and the corresponding hardware and software standards for high
performance microprocessors. The intention is to make experiments and investigate what an
ideal computer architecture may look like and to develop a complete
computer system that is more efficient than the currently prevailing
systems, such as x86, ARM, etc.
ForwardCom is also useful as a high-end alternative to RISC-V.
Starting from scratch and making a
complete vertical redesign allows us to learn from the history of past
mistakes and get rid of the heritage of old quirks that hamper
contemporary systems.
Highlights
- The ForwardCom instruction set is neither RISC nor CISC, but a new
paradigm with the advantages of both. ForwardCom has few
instructions, but many variants of each instruction. A consistent
template system with few instruction sizes combines the fast and
streamlined decoding and pipeline design of RISC systems with the
compactness and more work-done-per-instruction of CISC systems.
- An instruction can do multiple things, but only if it fits into
the pipeline system. There is no need for microcode.
- The ForwardCom design is scalable to support small embedded
systems as well as large supercomputers and vector processors
without losing binary compatibility.
- The instruction set is fully orthogonal. The same instruction can
be coded with integer operands of different sizes and floating point
operands of different precisions. The operands can be scalars or
vectors of any length. One operand of each instruction can be a
register, a memory operand with different addressing modes, or an
immediate constant. The other operands must be registers.
- Vector registers of variable length are provided for efficient
handling of large data sets.
- Array loops are implemented in a new flexible way that
automatically uses the maximum vector length supported by the
microprocessor in all but the last iteration of a loop. The last
iteration automatically uses a vector length that fits the remaining
number of elements. No extra code is needed to deal with remaining
data and special cases. There is no need to compile the code
separately for different microprocessors with different vector
lengths.
- No recompilation or update of software is needed when a new
microprocessor with a different vector register length becomes
available. The software is guaranteed to be forward compatible and
take advantage of the longer vectors of new microprocessor models
without recompilation.
- Strong security features are a fundamental part of the hardware
and software design.
- Memory management is simpler and more efficient than in
traditional systems. Various techniques are used for avoiding memory
fragmentation. There is no memory paging and no translation
lookaside buffer (TLB). Instead, there is a memory map with a
limited number of sections with variable size. All code is
position-independent.
- There are no dynamic link libraries (DLLs) or shared objects.
Instead, there is only one type of function libraries that can be
used for both static and dynamic linking. Only the part of the
library that is actually used is loaded and linked. The library code
is kept contiguous with the main program code in almost all cases.
An executable file can be re-linked to update a function library
or to adapt the program to a particular hardware
configuration, operating system, or user interface framework.
- A mechanism for calculating the required stack size is provided.
This can prevent stack overflow in most cases without making the
stack bigger than necessary.
- A mechanism for optimal register allocation across program modules
and function libraries is provided. This makes it possible to keep
most variables in registers without spilling to memory. Vector
registers can be saved in an efficient way that stores only the part
of the register that is actually used.
- Standards for software tools, ABI, file formats, system libraries,
etc. are defined in order to establish compatibility between
different programming languages and different platforms. It is
possible to code different parts of a program in different
programming languages.
A flexible instruction format
The ForwardCom instruction set is based on a consistent and flexible
modular format suitable for fast superscalar processors. Each
instruction uses one, two, or three 32-bit words.
It is possible to add still longer
instructions for application-specific purposes. Often-used
instructions can also be coded in a tiny format, where a 32-bit
instruction word contains two tiny instructions. Tiny instructions are
always paired.
A simplified sketch of the instruction format is shown here:
The basic instruction word is 32 bits, divided into the following
fields:
- Instruction length
- Tells whether the instruction uses one or more 32-bit words.
- Mode
- Tells which template is used, what the different fields are used
for, whether the instruction uses general purpose registers or
vector registers, whether there is a memory operand, and which
addressing mode is used.
- Operation
- Tells which instruction to do. There can be up to 64 multi-format
instructions. A multi-format instruction can have many different
formats, instruction lengths, and addressing modes. In addition,
there can be a large number of single-format instructions. One
operation code in ForwardCom corresponds to multiple different
operation codes in other systems because it can have several
different operand types, register types, vector lengths, masks,
addressing modes, etc.
- Destination register
- There are 32 general purpose registers and 32 vector registers.
The register specified in this field is used for the destination
(output) of the instruction. The same register is also used as
source (input) if there are not enough source registers in the other
fields.
- Operand type
- The operands can be 8-bit, 16-bit, 32-bit, and 64-bit integers and
half, single and double precision floating point numbers.
There is optional support for 128-bit integers and quadruple precision floating point numbers.
- Source register
- There can be one source register when template B is used or two
source registers when template A is used. Instructions with double
length can have three source registers. These can be general purpose
registers or vector registers. They can also be used for memory
pointers, array index, or vector length.
- Mask
- A register can be used as a mask or predicate to enable or disable
the operation and to specify various options. Masks are particularly
useful for vector operations where an operation can be enabled or
disabled for each vector element separately.
- Data
- Data fields can be used for immediate operands and for relative
addresses. Instructions with double length can have 32-bit data
fields. Instructions with triple length and 64-bit data fields are
optionally supported. Data fields can contain integer or floating
point numbers or option bits. Data can be compressed into
the smallest field size that fits the actual value.
Variable length vector registers
Vector registers are used for handling multiple data simultaneously.
The computer systems that are commonly used today have vector
registers with fixed lengths. Every time a new CPU model with longer
vectors comes on the market, the software has to be recompiled using a
new instruction set extension that supports the new vector size.
Software developers have to develop a new version of their software
every time a new CPU model comes on the market, and they have to
maintain and support several different versions of their software for
the different CPU models if they want to use all CPU models optimally.
This is so expensive that it is hardly ever done. Most of the software
that is sold today is optimized for CPU models that are already
obsolete.
A further problem with current designs is that it is impossible to
make your software save a vector register in a way that will be
compatible with future extensions of the vector length, because the instructions for doing
so have not yet been defined.
The need to solve these problems was a strong motivation for
developing ForwardCom. The ForwardCom architecture has variable-length
vector registers. The software can use the maximum vector length
supported by the CPU it is running on, or it can specify any vector
length less than this. The length of a vector register is stored in
the register itself. This is useful when a vector register is saved to
memory and you don't want to save more data than the register actually
contains. It is possible to make software that automatically uses the
maximum vector length that the CPU supports, even if this vector
length was not supported at the time the software was written. This is
what we call forward compatibility.
The variable-length vector registers can be used in a new and very
efficient type of loops that automatically uses the optimal vector
length. This is described in the next section.
A new efficient type of loops
Let's consider a simple loop that does something with an array of 10
floats. It may look something like this:
float my_array[10];
for (int i = 0; i < 10; i++) {
do_something(my_array[i]);
}
A simple implementation will use i as an index relative to the start address of the array
while counting i up to 10, and load one element at a time into a register:
A vector implementation in a current system will load a number of consecutive array elements, e.g. four, into a vector register,
and increment i by four for each iteration of the loop:
In this example, the loop will iterate two times and handle four array elements in each iteration.
There are two remaining elements in the end because the length of the array is not divisible by the vector length.
These remaining elements must be handled separately outside the loop.
The ForwardCom system can make this loop in a more efficient way. We are using a backward index from the end of the array.
The backward index counts down from 10 so that it always contains the remaining number of array elements to handle.
The backward index is also used for specifying the desired vector length. If we ask for a longer vector than the CPU supports,
then we will automatically get the maximum vector length. In this example the maximum length is four elements.
In the first iteration we ask for ten elements and get four. The backward index is now decremented by four.
In the next iteration we ask for six elements and get four. In the last iteration we ask for two elements and get two.

|
This method has several advantages. First, we don't need any extra code to handle the remaining array elements
if the array length is not divisible by the vector length. And second, it adjusts automatically to the maximum vector length of
the CPU it is running on. If we run the same code on a CPU with a maximum vector length of 8 then the loop will run two iterations,
handling 8 elements in the first iteration and 2 elements in the second iteration. If the maximum vector length is 16 then the loop
will run only one iteration with a vector length of 10 elements.
The ForwardCom instruction set has a special addressing mode
to support this loop method. It has a memory operand with a pointer register containing the end address and a backward index register
that is subtracted from this pointer. A vector memory operand always uses an extra register to specify the length of the vector.
We can use the same register for backward index and vector length, because we will get the maximum vector length when the specified
length is more than the maximum length.
The loop may contain function calls. Assume, for example, that the code in our example
involves the calculation of the logarithm of each vector element. The logarithm function is contained in a standard math function library.
Now, this function uses a vector register for input and a vector register for output. The information about the vector length is contained
in the vector register itself. Therefore, the logarithm function can handle a vector of any length and calculate the logarithms of all
vector elements simultaneously. A scalar (single element) parameter is simply handled by the function as a vector with one element.
This makes it easy for an optimizing compiler to convert scalar code to vector code, even if the code contains function calls.
Efficient memory management
The ForwardCom system includes standards for the application binary interface (ABI),
binary file format, memory organization, etc. These standards are designed so that memory fragmentation can be avoided,
or at least minimized. A typical running application will have only three memory blocks: program code, read-only data,
and read/write data (including static data, stack and heap). This makes memory management more efficient.
The number of memory blocks that a running process or thread has access to is so small that it all can be contained in a
memory map inside the CPU chip. This is very different from most common systems that have very large page tables.
A large page table requires fixed-size memory pages in order to make table lookup simple. But if we can keep the number of
table entries small then it is feasible to have variable-size table entries. The ForwardCom design has the goal of keeping
all code or data that a process has access to contiguous and to avoid memory fragmentation as much as possible.
This may make it possible to replace the huge multi-level page tables and translation-lookaside-buffers of current systems
with a small on-chip memory map. Each process and each thread has its own memory map.
Some of the techniques that are used for keeping data contiguous are:
- All addresses are relative and all code is position-independent. Code is addressed relative to the instruction pointer.
Static data are addressed relative to a special register called the data section pointer.
Code address and data address are independent of each other.
- The stack size is calculated by the compiler and linker so that the necessary stack size is known in advance,
except when the code contains recursive function calls.
- The heap size may be predicted by statistical methods.
The heap is expanded exponentially if the required size exceeds the predicted size.
- There are no dynamic link libraries (DLLs) or shared objects. A new re-linking feature is provided instead.
There is only one type of function libraries which can be used for both static and dynamic linking.
Function libraries are kept contiguous with the program that calls them, even in the case of dynamic linking.
Security features
Security is an integral part of the hardware and software design. This includes the following planned features:
- A flexible and efficient memory protection mechanism.
- Separation of call stack and data stack so that return addresses cannot be compromised by buffer overflow.
- Jump tables and function pointer tables are placed in read-only memory.
- Features for array bounds checking are built in.
- Optional methods for checking integer overflow.
- Each thread can have its own protected memory space, which is not accessible to parent and sibling threads within the same process.
- Device drivers and system functions have carefully controlled access rights.
These functions only have access to a specific block of memory that the calling process chooses to give it access to.
A device driver has only access to a controlled range of input/output ports and system registers.
- Application programs have only access to specific resources as specified in the executable file
header and controlled by the system.
- Mandatory standardized procedure for installing and uninstalling programs.
- There is no "undefined" behavior. There is always a limited set of permissible responses to an error condition.
Development tools
The following development tools are available:
- High-level assembler. The assembly language for ForwardCom looks like C or Java.
It understands all common operators and C-style branches and loops.
- Disassembler. The output of the disassembler can be assembled again to functional code in most cases.
- Linker. The ForwardCom linker supports relinking of executable files. Other features include communal sections,
function-level linking, and weak symbols.
- Library manager. The libraries produced by the library manager can be used for both static linking,
relinking, and dynamic linking.
- Emulator. A ForwardCom executable program can be emulated under Windows, Linux, or other systems.
- Debugger. The emulator can also be used as a debugger. There is no interactive debugging feature yet,
but the debugging process produces a list of executed instructions and their results.
- Libraries. A standard C library includes the most common C functions.
A math library currently contains only a few functions for demonstration purposes,
including trigonometric functions and numerical integration.
The same mathematical functions can be used with scalars and vectors as parameters.
- Code examples. A selection of code examples are provided as a starting point for experimentation.
Visions for the application of ForwardCom
ForwardCom will not readily replace the commonly used systems, even if it is better, because the users
need compatibility with existing hardware and software. However, the development of an ideal instruction set
architecture and a complete redesign of the ecosystem of hardware and software standards is a worthwhile exercise
in itself which may produce useful results and unexpected new discoveries. This project has already generated so
many valuable ideas that it is worth pursuing further.
Let's assume that the need for a new instruction set
will arise in the future, for whatever reason. Then it will be good to have a ready proposal that has been through
a long development process rather than starting from scratch with a limited time budget and end up with a suboptimal solution.
An open ongoing development process with inputs from anybody interested is likely to generate better results than the usual
closed industry process with its short-term commercial priorities.
ForwardCom may, for example, be useful for the following purposes:
- Supercomputers with very long vector registers.
- Applications where the security features of ForwardCom are needed.
- Niche products where compatibility with older systems is not required.
- Applications where the patent and license restrictions of other architectures would be an obstacle.
- Real-time systems where the efficient memory management and fast task switching of ForwardCom is useful.
- Applications that need application-specific instruction set extensions.
- Some of the new ideas generated by the ForwardCom project may be applied to other systems.
ForwardCom will also be useful as a sandbox for university projects and experiments with new ideas such as:
- Testing the concept of forward compatibility.
- Hardware development and research on the compromise between RISC and CISC.
- Research on control flow decoupling, as discussed in chapter 8.1 of the
manual.
- Custom instructions with on-chip FPGA.
- Testing the efficiency of large vectors, variable vector length, and efficient array loops.
- Testing the efficiency of memory management without translation lookaside buffer (TLB), and methods for minimizing memory fragmentation.
- Re-linkable executable files as an alternative to DLLs and plugins.
- Secure design to prevent common software attacks and vulnerabilities.
This includes: separation of call stack from data stack, code pointers in read-only memory,
private memory space for each thread, limited access rights for device drivers,
specific access rights for each executable file, and standardized software installation procedure.
- Experiments with improved NAN propagation as discused in chapter 6.3 of the
manual.
- Testing the efficiency of half-precision floating point vectors.
- Research on metaprogramming. The ForwardCom assembler includes planned and partially implemented metaprogramming features.
- Calculation of required stack size by the linker.
- Optimization of register allocation by providing information about register use in object files and library files.
Current status of the ForwardCom project
The ForwardCom project is in a stage of development. The basic instruction set architecture has been
designed and a complete set of application-level instructions is defined.
Some system-level instructions are not fully developed yet.
The structure of the binary file format for object files,
function libraries, and executable files has been defined in details.
The details of application binary interface standards (ABI), memory management standard, etc. have been defined.
The following binary tools have been developed: high-level assembler, disassembler, linker, library manager,
emulator, and debugger.
Hardware implementations in FPGA are being discussed.
Discussion forum
A discussion forum for ForwardCom development is provided at
www.forwardcom.info/forum.
Resources
Comparison of ForwardCom with other instruction sets
Complete manual for ForwardCom
Public repository on Github
Agner's optimization resources, mainly for x86 microprocessors
95619
By Agner Fog, 2017 - 2020.