The original proposal, 2 years ago
Posted: 2017-11-02, 6:56:19
Two years ago I made the first proposal for a new instruction set. I am copying the discussion thread from the old messageboard (http://www.agner.org/optimize/blog/read.php?i=421) to the new one here. As you can see, a lot has happened since then.
Original date: 2015-12-27
Table of Contents
An instruction set is a standardized set of machine instructions that a computer can run.
There are many instruction sets in use. To introduce a new instruction set is not an easy thing to do because it breaks the compatibility with existing software and hardware. Therefore, the successful introduction of a new instruction set is a rare occurrence in the evolution of computer technology, while extensions to existing instruction sets occurs frequently. Some commonly used instruction sets are poorly designed from the beginning and amended with many extensions and patches. One of the worst cases is the widely used x86 instruction set family. This instruction set is the result of a long series of short-sighted extensions and patches. The result of this development is a very complicated code system which is very difficult and costly to decode in a microprocessor. We need to learn from past failures in order to be prepared to make better choices from the start, in case the opportunity to design a new instruction set should come up. The purpose of this article is to construct an example of a new instruction set that is better designed from the start, based on the experience we have with existing instruction sets. The following principles are important to have in mind:
Basic architecture
Instruction format
A pure RISC instruction set has the advantage that all instructions have the same length. This makes it easy to decode multiple instructions in parallel. But it has the disadvantage of using a lot of precious space in the code cache. A CISC instruction format can have a variable instruction length. The well-known x86 format allows instructions of any length from 1 to 15 bytes. This makes the code more compact, but it is very complicated and expensive to decode. It is difficult for the microprocessor to decode multiple instructions in parallel because it needs to find the length of the first instruction before it knows where the second instruction begins, and the instruction length is determined by a complicated algorithm involving many elements of the instruction. Instruction decoding is therefore often a serious bottleneck.
The proposed instruction format is a compromise between these two principles. Instructions can have a few standardized lengths and formats, and the determination of the length is simple. This allows for smaller instructions to save size, and longer instructions when there is a need for more bits for address, data, registers or extra options. Many instructions exist in multiple versions with different sizes. The instruction format is completely orthogonal in the sense that the same instruction can be specified with register, memory or immediate operands, different integer sizes, different floating point precisions, different vector lengths, and different addressing modes.
An instruction can use one, two or three dwords of 32 bits each - that is 32, 64 or 96 bits. No other sizes are permitted. Instructions must be aligned to dword addresses. The first two bits (most significant bits) of the first dword of an in struction indicates the length:
00 = 1 dword
01 = 1 dword
10 = 2 dwords
11 = 3 dwords
In order to further save space in the code cache, there can be certain instructions which do multiple operations in a single short instruction, such as:
Registers
There are 32 universal registers, named r0 - r31. The proposed instruction set has only one type of registers. These registers can be used for all types of data: Boolean, 8-, 16-, 32-, 64- and (optionally) 128-bit signed and unsigned integers, floating point numbers with single, double and (optionally) quadruple precision, pointers, references, flags and predication masks. This reduces the number of different instructions because the same instruction can be used on different types of data, and because no instructions are needed for transferring data from one type of register to another. For example, the same 'AND' instruction can be used for operations on Booleans, for manipulating bits in integers, for manipulating the sign bit of floating point numbers, and for manipulating predication masks.
The same registers can also be used as vectors of any of these data types. The microprocessor must support vectors of at least 128 bits. Support for larger sizes is optional. Vector sizes up to 8192 bits can be specified by 3 bits in the instruction code. It is also possible to specify the largest available vector size in an instruction. This can be anything from 128 bits and up, with no upper limit. Software can take advantage of future extensions by specifying the largest available vector size. The largest size can be modified by a control register to any power of 2 from 128 to the largest size supported by the microprocessor.
The unused part of a register is always set to zero whenever a register is modified. No instruction leaves part of a register unchanged except for instructions intended for blending or interleaving data. This is important in order to avoid false dependencies on the previous value of the full register, which is known to cause serious performance problems in some existing processors (known as partial register stall). The processor does not actually need to spend power on setting all the superfluous bits to zero. Typically, it will simply turn off the unused parts of execution units and data buses in order to save power.
Stack
There is one stack. The stack register is r31. Including the stack register as one of the universal registers makes it possible to use it as a base pointer in memory addressing and to modify the stack frame with arithmetic instructions. The stack register needs only be 64 bits.
Instruction pointer
The instruction pointer is 64 bits. It is not included in the universal registers. The reason for this decision is to avoid the possible modification of the instruction pointer by arithmetic instructions, which would make branch prediction difficult.
Flags
There is no dedicated flags register. Registers r1 - r7 can be used as predicate registers or mask registers. Many instructions can be predicated. A predicated non-vector instruction will use one of these registers as predicate, and execute the instruction only when bit 0 in the predicate register is 1. The predicate register is thus also a Boolean variable. Execution is unconditional when r0 is specified as the predicate register.
The predication mechanism can be vectorized. A predicate vector is also known as a mask. A masked vector instruction works in the following way. Each element in the vector is processed only if the corresponding element in the mask register has 1 in its least significant bit. The mask register is treated as a vector of Booleans, where each element in the Boolean vector has the same number of bits as the data vectors in the instruction, and only the least significant bit in each Boolean vector element is used, while the remaining bits are ignored. (Other systems use the most significant bit, or all bits, in the mask, but it is preferred to use the least significant bit for the sake of compatibility between Boolean scalars and Boolean vectors). Results that are masked off are either unchanged or set to zero, depending on the instruction. Some instructions support both options to be selected with a feature bit.
Instructions for extended precision arithmetic, such as add-with-carry and subtract-with borrow work in the following way. One register is specified in the predicate register field of the instruction code. Bit 0 of this register is used as both carry-in and carry-out. The traditional arithmetic flags are written to a few bits of the predicate register:
bit 0: carry flag
bit 1: zero flag
bit 2: overflow of signed arithmetic
bit 3: sign bit
bit 4: negative = sign xor overflow
If the predicate register for an add-with-carry instruction is specified as r0 then the carry-in will be 0, but the arithmetic flags for the result will be written to r0. Shift and rotate instructions can output a carry to a predicate register, but may not have a carry-in. There are no rotate-through-carry instructions, but an add-with-carry of a register to itself can be used as a rotate-left-through-carry (Rotate through carry instructions are rarely used anyway, and they are very inefficient on many processors). Integer and floating point compare instructions also produce these flags.
The carry mechanism can be vectorized so that multiple add-with-carry operations can be executed in parallel.
Branches
Branching is done with combined arithmetic-and-branch instructions. These are ALU instructions such as add, subtract, compare, bit test, etc. combined with a conditional jump, for example: subtract and jump if not zero, compare and jump if above, test a specific bit and jump if it is zero. These instructions cannot be vectorized. The vector size field is used as condition code. There is no need to support predicated jump instructions because these can be replaced by a combined bit test and conditional jump instruction. Multiway branches can be implemented with indirect jump or indirect call.
Debug and interrupt flags
There are various control registers which can be used for debugging purposes, interrupt control, etc.
Addressing modes
The address space uses 64-bit addresses only. Addresses are always relative to the instruction pointer, stack pointer or a register pointer. Absolute addressing does not need not be supported. The following addressing modes are supported:
Direct conditional and unconditional jumps and calls are always relative to the instruction pointer with 8-bit or 32-bit sign-extended offset, scaled by 4 because all instructions are aligned to addresses divisible by 4.
CPUID
A CPUID instruction must have functions for telling whether optional features are supported, e.g. 128-bit integers, quadruple precision floating pont, and the maximum vector size for each type of operands. There should also be features for telling how efficient certain instructions are, to help software determine the optional coding version.
Proposed code structure
An instruction code contains a combination of the fields described below, where some of the fields can be omitted. The total size of all the fields must be 32, 64 or 96 bits.
Instruction length: 2 bits.
00 = 1 dword = 32 bits
01 = 1 dword = 32 bits
10 = 2 dwords = 64 bits
11 = 3 dwords = 96 bits
Instruction format: 2 or more bits.
Each combination of the instruction length and instruction format bits defines a class of instructions having a particular combination of the remaining fields. In other words, the combination of instruction length and instruction format bits determines which of the following fields are present, and their sizes.
Instruction code: 6 or more bits.
These bits are used for distinguishing the individual instructions, such as add, move, jump, etc. The number of instruction code bits is simply the number of bits not used for anything else. Therefore, the number of instruction code bits can be different for different instruction formats. The instruction bits are not necessarily contiguous if the placement of other fields on fixed positions has higher priority in the design.
Register: 1 - 4 fields of 5 bits each.
Used for both operand registers, base pointer and index register.
Predicate register: 3 bits.
Specifies a register used for predicated scalar instructions, masked vector instructions, and flags input and output. Only r1 - r7 can be used as predicate register. r0 means no predicate.
Operand size and type: 3 bits.
Defines the type, size and precision of operands, integer or floating point. The size for integer operands can be 8, 16, 32, 64, and optionally 128 bits. The precision of floating point operands can be single, double, and optionally quadruple precision. Half precision is not supported, except in conversion instructions.
Vector length: 3 bits.
Specifies the length of vectors in bits. Possible values are: scalar, 128, 256, 512, 1024, 2048, 4096, and max. Support for values above 128 are optional. The size of operands is as determined by the operand size/type when "scalar" is specified. A vector will contain as many elements of the specified operand size as can be contained in the vector size. For example, a 256 bit vector can contain 8 elements of 32 bits each. The "max" specification gives the largest vector size supported by the processor. This value depends on the processor and must be a power of 2. The minimum allowed value is 128, with no upper limit. The max value may be different for different operand sizes. A piece of software can take advantage of future extensions by specifying the max vector size. The max value can be reduced by settings in a control register.
Addressing modes: 2 bits.
The following addressing modes are defined for memory operands. An instruction can have no more than one explicit memory operand with this specifiation.
00: IP + index + 32 bits offset (specify r31 for no index)
01: base + 8 or 32 bits offset
10: base + index + 8 or 32 bits offset
11: base + index * operand size + 8 or 32 bits offset
The base and index registers are specified in register fields. The offset size (8 or 32 bits) depends on the instruction format. 8-bit offsets are always multiplied by the specified operand size in bytes. For example, an operand size of 32 bits = 4 bytes will multiply the value in the offset field by 4. 32-bit offsets are not multiplied by this factor. Offsets are always sign-extended. It is not required to support 16-bit or 64-bit offsets or absolute addressing.
Jumps and calls have an offset of 8 or 32 bits relative to the instruction pointer. This offset is multiplied by 4 because all instructions have sizes that are multiples of 4 bytes.
Address offset: 8 or 32 bits.
This field is used as specified above under addressing mode.
Immediate data operand: 8, 32 or 64 bits.
An 8-bit immediate value is interpreted as an integer, sign extended to the specified operand size. The signed value is converted to floating point if a floating point operation is specified.
A 32-bit or 64-bit immediate is interpreted as an integer for integer operations or a single or double precision float for floating point operations. The integer immediate constant is sign-extended if necessary. The floating point immediate constant is converted to the desired precision if necessary.
16-bit immediates are not necessarily supported.
Rounding mode: 2 bits.
Optionally specifies the rounding mode used in floating point operations and conversions. Possible values are: round to nearest or even, round down, round up, truncate towards zero. The default value if there is no rounding mode field is "round to nearest or even". This option field is useful in float-to-integer conversion instructions, but rarely needed in other contexts. May be included in long versions of floating point instructions.
Exception control: 1 bit.
Enables or suppresses interrupts in case of numerical errors. This can be used for controlling exceptions in case of overflow and other errors in floating point operations. Can also be used for checking for overflow in integer arithmetic. An unsigned integer compare instruction with exception enabled can be used for checking if an array index is out of bounds. This feature may be included in long versions of arithmetic instructions.
Broadcast: 1 bit.
If 1, specifies that the last source operand is a scalar to be broadcast into all the vector elements. (Unnecessary when this is an immediate operand).
Zero masked data: 1 bit.
Specifies whether masked-out elements are set to zero or left unchanged. This bit may replace the broadcast bit on instructions with only register operands, or it may be a separate bit.
Other fields.
Other fields may be added if specific features are needed. Alternatively, an immediate data field may be used for specifying additional feature options.
Formats.
Commonly used instructions may be implemented in several different formats and instruction lengths, preferably with the same value in the instruction code bits. For example, an addition instruction, A = B + C, might be implemented in the following formats:
Combined ALU and conditional jump instructions may be implemented in the following formats:
3-input instructions, such as fused multiply-and-add may be implemented in the following formats:
Less commonly used instructions may be implemented in just one or a few different formats.
FPGA
The microprocessor can have an optional FPGA or similar programmable hardware. This structure is used for making application-specific instructions or functions, e.g. for coding, encryption, data compression, signal processing, text processing, etc. This reduces the need for hard-coded application-specific instructions.
If the processor has multiple CPU cores then each core may have its own FPGA. The hardware definition code is stored in its own cache for each core. The operating system should prevent, as far as possible, that the same core is used for different tasks that require different hardware codes. If this cannot be avoided then the code, as well as the contents of any memory cells in the FPGA, must be saved on each task switch. This saving may be implemented as lazy, i.e. the swap of contents is only made when the second task needs the FPGA structure that contains code for the first task.
Extensibility
The evolution of the x86 instruction set is full of short-sighted decisions with no room for future extensions. All kind of weird patches have been used to extend an instruction set that was never designed to be extensible. We must learn from these mistakes and consider the predictable need for future extensions when designing an instruction set.
There is reason to suspect that many of the instructions in the current x86 instruction set have been added with short-sighted marketing reasons in mind. Every new generation of microprocessors must have some new features that the competitors don't have, or new features that can be hyped to make customers buy the new version, according to the marketing logic. Some of these instructions are now obsolete, but still supported by the hardware.
The design of a stable instruction set should not be subject to competition and marketing whims, but designed by an open process with participation of the international hardware and software community, similar to the standardization work in other technological areas. A collective decision process reduces the risk of mistakes and short-sighted decisions.
The proposed instruction set is orthogonal, which reduces the number of different instructions. The inclusion of an FPGA reduces the need for application-specific instructions.
In addition to these considerations, it is necessary to add space for future extensions of the instruction set. Some of the instruction formats should have a large number of unused instruction code bits that can be used for future instructions or option bits. A few instruction format codes should be reserved for future extensions. All codes that begin with 111, i.e. 11 in the instruction length bits and 1 in the first bit of the instruction format field, should be reserved for future extensions. These bits could be used in the future either for more 3-dword formats with many instruction bits, or for 4-dword formats. This decision will be left to the future.
An attempt to execute an instruction with an unknown instruction code should cause an interrupt (exception) in most cases. This makes it possible to emulate new instructions on old microprocessors. In some cases, however, it is desired to make extensions that do not cause interrupts on microprocessors that don't support them. Historically, this has been done with extensions that affect performance, but not functionality, such as memory prefetching and branch prediction hints. This kind of extensions can be made by using various option bits in contexts where they previously made no sense, for example rounding mode bits in an integer instruction. Thus, the processor should ignore certain unused option bits in certain instructions to make this kind of performance extensions possible. Also, a small range of instruction codes should be reserved for future performance-tuning instructions, which will be ignored on processors that don't support them. To be more specific, we will have three categories of unused codes:
Extension of the vector register size is straightforward without the need to define any new instructions. This makes it possible for software to utilize a new extended vector size even without recompilation. The software can simply specify the maximum vector size and get information from a CPUID instruction about what this maximum vector size is.
We have seen in the history of x86 processors that the first processor generation to support a new and larger vector size has typically done so with poor efficiency. In most cases, it has used a half-size execution unit twice to do a full-size vector operation. This was not necessarily a bad design choice because the software that supports a new instruction set extension typically lags several years behind the hardware.
The situation is different with the extension mechanism proposed here. The software will be able to utilize a vector size extension immediately. The microprocessor should not allow a larger vector size than it can execute more efficiently than if software used the next smaller vector size twice. It may still be worthwhile to allow a vector size that is larger than the execution unit and use this unit multiple times to process a full-size vector. This will save bandwidth in the decoder and use fewer registers than the alternative of repeating the instruction in software. The CPUID instruction should provide complete information about this. This means that it should specify both the maximum vector size that can execute at full throughput and the maximum vector size that can execute at all. These values may be different for different types of operands.
Portability
The ABI, object file format, etc. should be standardized as far as possible in order to allow the same code to be compatible with different operating systems and platforms. This would make it possible, for example, to use the same function libraries in different operating systems. This can easily be achieved for libraries that are doing some mathematical operation and not using any system functions. A more ambitious goal is to establish portability even when some common system functions are involved, such as time functions or handling of multithreaded code. Most importantly, there should be a portable way of generating error messages from a function library. This could be obtained with an error message instruction. This instruction should generate an interrupt (throw an exception). A few register operands can contain codes indicating the type of error, and a memory operand can point to an error message text. All platforms should be able to handle this error condition in a way that is appropriate for the type of user interface. In console mode applications, for example, the error message should go to the stderr output. In graphical user interface (GUI) applications, the error message should be shown in a pop-up window, or whatever method is appropriate for the specific GUI framework.
Error messages should be in the English language by default, with an optional feature for multi-language support. We can expect the need for multi-language support to be decreasing. The problems with multi-language applications are discussed in this document https://en.wikibooks.org/wiki/Usabilit ... nalization.
ABI and calling conventions
This is an example of how an efficient ABI (Application Binary Interface) can be designed.
Function calls will use registers for parameters as far as possible. The first 24 parameters to a function are transferred in register r0 - r23. Any additional parameters are transferred on the stack in C language order. These parameters are removed from the stack by the caller. The function return value is in r0. Push and pop instructions are rarely used. The return instruction has no offset. The stack is kept aligned by 128 before any call instruction.
Variable argument lists are transferred on the stack with 64 bits for each argument.
Tuples: A structure or class or encapsulated array for which all non-static elements have the same type is transferred in a single vector register if the total size does not exceed 128 bits.
Parameters that do not fit into a single register are transferred by a reference to a memory object allocated by the caller. This applies to: structures and classes with elements of different types, or bigger than 128 bits, or having a non-standard copy constructor or destructor or virtual member function. It is the responsibility of the caller to call any copy constructor or destructor. Any parameters beyond the first 24 parameters are transferred in the same way as if they were in a register, using 64 or 128 bits of stack space, as appropriate.
A return value that does not fit into a register is transferred to a memory location allocated by the caller. A reference to this memory location is treated as the first parameter (before any 'this' pointer).
There are no registers with callee-save status in the case of static linking. This is because the called function does not know the maximum vector register size required by the caller. Instead, there is a mechanism that allows the caller to know which registers are modified by the called function. This information is stored in the object file for static link libraries. The object file format must support this information, which must be stored in the same file as the library function in order to make sure it has the right version. Compilers supporting "whole program optimization" can read this information from a library file before allocating registers.
This mechanism cannot be applied to dynamic linking. Instead, dynamic link libraries are prohibited from modifying certain registers.
The object file format should be a modified ELF format. Dynamic linking should use the Windows DLL method rather than the UNIX shared object method. The code uses position-independent addressing as far as possible. Any remaining relocation is performed at load time. Symbol imputation is not used. This eliminates the need for the inefficient global offset table (GOT) and procedure linkage table (PLT).
Information used for exception handling and stack unrolling should use a standardized and portable table-based method. Debugging information should also be standardized.
Assembly language syntax
The syntax for x86 assembly code has never been officially standardized, but each assembler uses its own dialect. The definition of a new instruction set should include the definition of a standardized assembly language syntax, preferably with the destination operand first.
Summary of advantages
The instruction set proposed here is a compromise between the RISC and CISC principles. A RISC instruction set with a fixed instruction size makes it easy to decode multiple instructions in parallel, but it is a vaste of precious code cache space. If the fixed instruction size is big enough to accommodate all possible instruction types, then it must necessarily be too big for the most common simple instructions and therefore take up too much space in the code cache. The code cache is a precious resource because it is impossible to make the cache bigger without also making it slower. A CISC instruction set with many different instruction lengths makes it difficult to decode multiple instructions in parallel, and this can be a serious bottleneck. The proposed instruction set has just a few standardized instruction lengths: one, two and three dwords of 32 bits each. The length of the instruction is determined by the first few bits. This makes it possible to determine the lengths of multiple instructions in a single clock cycle by a process that resembles the look-ahead carry mechanism in binary adders.
The instruction set is completely orthogonal. An ordinary arithmetic or logic instruction, such as e.g. addition, can have many different versions for different types of operands. It can handle integers of 8, 16, 32, 64, and possibly 128 bits, as well as floating point numbers of single, double, and possibly quadruple precision. The last source operand can be a register, a memory operand, or an immediate constant. The same instruction can handle scalars or vectors of any length. This makes programming simpler and reduces the number of different instructions.
There is only one type of register. The same registers can be used for many different purposes, including integers and floating point numbers of all different sizes and precisions, as well as for Booleans, flags, pointers and references. The registers can also be used for vectors and masks.
Many instructions can be predicated, so that the instruction is either executed or not, depending on a Boolean variable stored in a predicate register. The predicate mechanism can be vectorized, so that the operation is turned on or off for each element in a vector, depending on a mask register containing a vector of Booleans.
Instructions can have short versions that save space by using only two operands, by omitting certain option bits, by using an 8-bit scaled offset in a memory operand, or by using a signed 8-bit constant as the immediate operand. For example, a double precision floating point addition can have immediate operands of three different sizes: a signed 8-bit integer which will be converted to floating point, a single precision float, or a double precision float. This constant is available at an early stage in the CPU pipeline so that there is enough time for converting it to the necessary size and precision without delaying the execution. The need to fetch numeric constants from data memory is eliminated in most cases because numeric constants can be contained in the instructions. This will increase the speed and reduce the load on the data cache. In most cases, the code size will not be increased by the inclusion of numeric constants in the instructions because they replace the addresses (typically 32 bits) of constants stored in data memory, and because it will fit the constants into smaller formats whenever possible. Immediate constants can even be used with vector instructions where the same constant will be applied to all elements in the vector.
The size of the registers is not fixed in the design. It is possible to make bigger and more powerful microprocessors simply by making the registers bigger so that they can hold larger vectors. This mechanism is orthogonal as well. There are three bits in the instruction code which determines the vector length (or a scalar). This makes it possible to write software for future microprocessors with bigger vector registers that do not exist yet. Setting the vector length bits to 111 will give the largest vector size that the microprocessor supports, whatever this is. This makes it simple to support all vector sizes in the same piece of software. This feature also makes it possible to save an entire register even though the maximum register size is not known when the software is compiled. This can be useful in task switches, exception handlers, device drivers and system libraries. There is no limit to how big the maximum vector size can be. A CPUID instruction will tell the software what the maximum vector size is, and there will be a feature that enables a software program to reduce the maximum vector size if it is excessive.
The conventions for function calling, as well as other ABI details, should be specified together with the instruction set. This will improve compatibility and make it possible to use the same function libraries with different compilers, different programming languages, and different operating systems. There are 32 registers. This makes it possible to use registers for function parameters in almost all cases.
Can existing instruction sets be fixed?
The commonly used instruction sets can be divided into two general types, RISC and CISC. The RISC instruction sets generally have a more or less fixed instruction size. All instructions have the same number of bits. The advantage of a RISC design is that the fetching and decoding of instructions is simple and fast. The disadvantege is that commonly used simple instructions take more space than necessary while complicated instructions do not fit into the limited instruction size. Instructions that need many bits for addresses or constants do not fit into the RISC design.
A CISC instruction set has a variable instruction length. The advantage of this is that simple, commonly used instructions can be as small as a single byte, while more complex instructions or instructions with large addresses or constants can have a length that fits the purpose. This provides optimal use of the code cache. The disadvantage is that it is complicated to decode the instructions. Modern microprocessors can execute three or four instructions in parallel in a single clock cycle if no data dependence prevents this. But it is difficult to decode multiple instructions simultaneously when you have to determine the length of the first instruction before you know where the second instruction begins. Therefore, the bottleneck in a CISC processor is quite often decoding rather than execution.
The present article has proposed a compromise between RISC and CISC. The widely used x86 instruction set is a CISC design. Mosts other instruction sets in common use today are RISC designs.
x86 instruction set
The x86 instruction set has a long heritage dating back to the 8086 processor in 1978 where code density was of prime importance. It has been developed through many generations of additions and extensions without ever loosing backwards compatibility. It is a CISC instruction set where instructions can have any length from 1 to 15 bytes, and it is quite complicated to determine the length of each instruction. It has many different types of registers. The latest extension, AVX-512 has 16 general purpose registers of 64 bits each, 6 segment registers of which only 3 are used in 64-bit mode, 8 floating point registers of 80 bits each, 8 MMX registers of 64 bits each which are overlaid on the floating point registers, 32 vector registers of 512 bits each, 8 mask registers of 64 bits each, a flags register and an instruction pointer. This patchwork could certainly need a redesign. Can it be combined with the design principles that are proposed here?
An easy solution would allow the two kinds of code to be used interchangeably and mixed. The new instructions would have to use some bit patterns that are not already in use in the old system. The x86 instruction set has 20 byte-codes that are currently used only in 16-bit and 32-bit mode, mostly for obsolete instructions. These codes can be used for other purposes in 64-bit mode. Therefore, it is possible to make new extensions that can be used only in 64-bit mode. We would prefer 64-bit mode anyway, so it would be possible to make extensions that implement some of the principles described here and still preserve backwards compatibility, but this would still be only patches on a fundamentally flawed, inefficient and outdated design. The 20 unused code bytes are scattered around the code map with only few adjacent code bytes, so it would be impossible to use more than a few of these code bytes without making the whole system completely messy. Most of the bits in the first byte of any new code would therefore be fixed and unusable in such a hypothetical new code design.
A better solution would be to implement a separate mode for the new instruction design and a system for switching between the new mode and the legacy modes. The improvement in performance that can be obtained with a new instruction design is probably not high enough to justify the complications of a dual code system. Instead, we should be prepared to seize the opportunity in case the need for a major revision should arise for other reasons. It is not possible to make a decoder that translates the old codes to the new ones at runtime, because the new system does not support the many different types of registers that the old system has. A translation from the new system to the old one is also not possible. Instead, we would need two seperate decoders that translate the old and the new codes, respectively, to the internal micro-operation format. This micro-operation format probably needs to be expanded to make space for 64-bit immediate constants, but the extra bits can be disabled when they are not needed, in order to save power.
The existing execution units could relatively easily be modified to support the new code design. The 32 universal registers of the new design should obviously be identical to the 32 vector registers of the old design. Combined ALU-and-conditional-jump instructions are already implemented internally in both Intel and AMD processors even though they are not available as x86 instructions.
It is a problem that many current processors have their execution units divided into two main clusters: One cluster is connected to the general purpose registers and handles integer scalar operations, pointer addressing and jumps. The other cluster is connected to the floating point and vector registers and handles all floating point and vector operations. All transfers of data between these two clusters typically involve a delay of one clock cycle. This two-cluster design would be a problem for the new instruction set where all units need access to the same register file.
Itanium instruction set
The Itanium instruction set is a very ambitions RISC instruction set. Itanium instructions are joined into bundles with a fixed size of 128 bits, containing three instruction codes of 41 bits each and a 5-bit template. The three instructions in a bundle will execute in parallel. This explicit parallelism puts a lot of work on the compiler to schedule instructions that can execute in parallel without violating the program logic. The Itanium has a rotating register stack where each program function allocates the number of registers it needs. It has many other advanced features, such as explicitly speculative instructions. The itanium design has not been very successful, due mainly to the difficulties of making a suitable compiler. Another obstackle to the commercial success of the Itanium was a poor support for backwards compatibility with existing software. The Itanium system is so different from other systems that there would be little advantage in combining it with a new instruction set.
Other RISC instruction sets
Most other commonly used instruction sets today are RISC designs. These designs are generally simple and efficient. The instruction length is typically 32 bits. Some systems, such as ARM-Thumb-2 and AVR32 can use a mixture of short 16-bit instructions and longer 32-bit instructions. Most systems have several different register types. Some RISC processors support vector instructions with 128-bit vectors. There is a limit to the number of different instructions that can be coded in an instruction with a fixed size. It is a general problem with RISC instruction sets that they cannot support complex instructions with many option bits. This makes it difficult to add new options and features that the x86 instruction set has, such as masked vector operations, options for controlling rounding mode, etc. The limited instruction size of the RISC systems is also a problem where large addresses or large numeric constants are needed. It is not possible to define a large numerical constant or a jump to a distant address with a single instruction in a RISC design with a limited instruction size. Some of the RISC processors already have support for more than one instruction set and features for switching between these modes. An additional mode for a new instruction set could be added to these processors without serious problems.
Original date: 2015-12-27
Table of Contents
- Introduction
- Basic architecture
- Proposed code structure
- FPGA
- Extensibility
- Portability
- ABI and calling conventions
- Assembly syntax
- Summary of advantages
- Can existing instruction sets be fixed?
An instruction set is a standardized set of machine instructions that a computer can run.
There are many instruction sets in use. To introduce a new instruction set is not an easy thing to do because it breaks the compatibility with existing software and hardware. Therefore, the successful introduction of a new instruction set is a rare occurrence in the evolution of computer technology, while extensions to existing instruction sets occurs frequently. Some commonly used instruction sets are poorly designed from the beginning and amended with many extensions and patches. One of the worst cases is the widely used x86 instruction set family. This instruction set is the result of a long series of short-sighted extensions and patches. The result of this development is a very complicated code system which is very difficult and costly to decode in a microprocessor. We need to learn from past failures in order to be prepared to make better choices from the start, in case the opportunity to design a new instruction set should come up. The purpose of this article is to construct an example of a new instruction set that is better designed from the start, based on the experience we have with existing instruction sets. The following principles are important to have in mind:
- The instruction set should have a simple and consistent design.
- The instruction set should represent a suitable compromise between the RISC principle that enables fast decoding, and the CISC principle that makes more efficient use of code cache resources.
- The design should be expandable so that new instructions and extensions can be added in a consistent and predictable way.
- The instruction set should be designed through an open process with the participation of the international hardware and software community.
- The instruction set should be non-proprietary and allow anybody to make compatible software, hardware and equipment for test, debugging and emulation.
- Decisions about design and extensions should not be determined by the short term marketing considerations of an oligopolistic microprocessor industry but by the long term needs of the entire hardware and software community and NGOs.
- The design should allow appliction-specific extensions.
Basic architecture
Instruction format
A pure RISC instruction set has the advantage that all instructions have the same length. This makes it easy to decode multiple instructions in parallel. But it has the disadvantage of using a lot of precious space in the code cache. A CISC instruction format can have a variable instruction length. The well-known x86 format allows instructions of any length from 1 to 15 bytes. This makes the code more compact, but it is very complicated and expensive to decode. It is difficult for the microprocessor to decode multiple instructions in parallel because it needs to find the length of the first instruction before it knows where the second instruction begins, and the instruction length is determined by a complicated algorithm involving many elements of the instruction. Instruction decoding is therefore often a serious bottleneck.
The proposed instruction format is a compromise between these two principles. Instructions can have a few standardized lengths and formats, and the determination of the length is simple. This allows for smaller instructions to save size, and longer instructions when there is a need for more bits for address, data, registers or extra options. Many instructions exist in multiple versions with different sizes. The instruction format is completely orthogonal in the sense that the same instruction can be specified with register, memory or immediate operands, different integer sizes, different floating point precisions, different vector lengths, and different addressing modes.
An instruction can use one, two or three dwords of 32 bits each - that is 32, 64 or 96 bits. No other sizes are permitted. Instructions must be aligned to dword addresses. The first two bits (most significant bits) of the first dword of an in struction indicates the length:
00 = 1 dword
01 = 1 dword
10 = 2 dwords
11 = 3 dwords
In order to further save space in the code cache, there can be certain instructions which do multiple operations in a single short instruction, such as:
- Set multiple registers to the same constant value (typically zero). The first register and the number of registers is specified.
- Read multiple registers from consecutive memory addresses. Optionally increment pointer by the size. Can use the stack pointer or any other pointer register.
- Save multiple registers to consecutive memory addresses. Optionally increment or decrement pointer by the size. Can use the stack pointer or any other pointer register.
- Combined arithmetic operation and conditional jump.
Registers
There are 32 universal registers, named r0 - r31. The proposed instruction set has only one type of registers. These registers can be used for all types of data: Boolean, 8-, 16-, 32-, 64- and (optionally) 128-bit signed and unsigned integers, floating point numbers with single, double and (optionally) quadruple precision, pointers, references, flags and predication masks. This reduces the number of different instructions because the same instruction can be used on different types of data, and because no instructions are needed for transferring data from one type of register to another. For example, the same 'AND' instruction can be used for operations on Booleans, for manipulating bits in integers, for manipulating the sign bit of floating point numbers, and for manipulating predication masks.
The same registers can also be used as vectors of any of these data types. The microprocessor must support vectors of at least 128 bits. Support for larger sizes is optional. Vector sizes up to 8192 bits can be specified by 3 bits in the instruction code. It is also possible to specify the largest available vector size in an instruction. This can be anything from 128 bits and up, with no upper limit. Software can take advantage of future extensions by specifying the largest available vector size. The largest size can be modified by a control register to any power of 2 from 128 to the largest size supported by the microprocessor.
The unused part of a register is always set to zero whenever a register is modified. No instruction leaves part of a register unchanged except for instructions intended for blending or interleaving data. This is important in order to avoid false dependencies on the previous value of the full register, which is known to cause serious performance problems in some existing processors (known as partial register stall). The processor does not actually need to spend power on setting all the superfluous bits to zero. Typically, it will simply turn off the unused parts of execution units and data buses in order to save power.
Stack
There is one stack. The stack register is r31. Including the stack register as one of the universal registers makes it possible to use it as a base pointer in memory addressing and to modify the stack frame with arithmetic instructions. The stack register needs only be 64 bits.
Instruction pointer
The instruction pointer is 64 bits. It is not included in the universal registers. The reason for this decision is to avoid the possible modification of the instruction pointer by arithmetic instructions, which would make branch prediction difficult.
Flags
There is no dedicated flags register. Registers r1 - r7 can be used as predicate registers or mask registers. Many instructions can be predicated. A predicated non-vector instruction will use one of these registers as predicate, and execute the instruction only when bit 0 in the predicate register is 1. The predicate register is thus also a Boolean variable. Execution is unconditional when r0 is specified as the predicate register.
The predication mechanism can be vectorized. A predicate vector is also known as a mask. A masked vector instruction works in the following way. Each element in the vector is processed only if the corresponding element in the mask register has 1 in its least significant bit. The mask register is treated as a vector of Booleans, where each element in the Boolean vector has the same number of bits as the data vectors in the instruction, and only the least significant bit in each Boolean vector element is used, while the remaining bits are ignored. (Other systems use the most significant bit, or all bits, in the mask, but it is preferred to use the least significant bit for the sake of compatibility between Boolean scalars and Boolean vectors). Results that are masked off are either unchanged or set to zero, depending on the instruction. Some instructions support both options to be selected with a feature bit.
Instructions for extended precision arithmetic, such as add-with-carry and subtract-with borrow work in the following way. One register is specified in the predicate register field of the instruction code. Bit 0 of this register is used as both carry-in and carry-out. The traditional arithmetic flags are written to a few bits of the predicate register:
bit 0: carry flag
bit 1: zero flag
bit 2: overflow of signed arithmetic
bit 3: sign bit
bit 4: negative = sign xor overflow
If the predicate register for an add-with-carry instruction is specified as r0 then the carry-in will be 0, but the arithmetic flags for the result will be written to r0. Shift and rotate instructions can output a carry to a predicate register, but may not have a carry-in. There are no rotate-through-carry instructions, but an add-with-carry of a register to itself can be used as a rotate-left-through-carry (Rotate through carry instructions are rarely used anyway, and they are very inefficient on many processors). Integer and floating point compare instructions also produce these flags.
The carry mechanism can be vectorized so that multiple add-with-carry operations can be executed in parallel.
Branches
Branching is done with combined arithmetic-and-branch instructions. These are ALU instructions such as add, subtract, compare, bit test, etc. combined with a conditional jump, for example: subtract and jump if not zero, compare and jump if above, test a specific bit and jump if it is zero. These instructions cannot be vectorized. The vector size field is used as condition code. There is no need to support predicated jump instructions because these can be replaced by a combined bit test and conditional jump instruction. Multiway branches can be implemented with indirect jump or indirect call.
Debug and interrupt flags
There are various control registers which can be used for debugging purposes, interrupt control, etc.
Addressing modes
The address space uses 64-bit addresses only. Addresses are always relative to the instruction pointer, stack pointer or a register pointer. Absolute addressing does not need not be supported. The following addressing modes are supported:
- Instruction pointer + 32 bit sign-extended offset
- Instruction pointer + index register + 32 bit sign-extended offset
- Base register + 8 or 32 bit sign-extended offset
- Base register + index register + 32 bit sign-extended offset
- Base register + scaled index register + 32 bit sign-extended offset
Direct conditional and unconditional jumps and calls are always relative to the instruction pointer with 8-bit or 32-bit sign-extended offset, scaled by 4 because all instructions are aligned to addresses divisible by 4.
CPUID
A CPUID instruction must have functions for telling whether optional features are supported, e.g. 128-bit integers, quadruple precision floating pont, and the maximum vector size for each type of operands. There should also be features for telling how efficient certain instructions are, to help software determine the optional coding version.
Proposed code structure
An instruction code contains a combination of the fields described below, where some of the fields can be omitted. The total size of all the fields must be 32, 64 or 96 bits.
Instruction length: 2 bits.
00 = 1 dword = 32 bits
01 = 1 dword = 32 bits
10 = 2 dwords = 64 bits
11 = 3 dwords = 96 bits
Instruction format: 2 or more bits.
Each combination of the instruction length and instruction format bits defines a class of instructions having a particular combination of the remaining fields. In other words, the combination of instruction length and instruction format bits determines which of the following fields are present, and their sizes.
Instruction code: 6 or more bits.
These bits are used for distinguishing the individual instructions, such as add, move, jump, etc. The number of instruction code bits is simply the number of bits not used for anything else. Therefore, the number of instruction code bits can be different for different instruction formats. The instruction bits are not necessarily contiguous if the placement of other fields on fixed positions has higher priority in the design.
Register: 1 - 4 fields of 5 bits each.
Used for both operand registers, base pointer and index register.
Predicate register: 3 bits.
Specifies a register used for predicated scalar instructions, masked vector instructions, and flags input and output. Only r1 - r7 can be used as predicate register. r0 means no predicate.
Operand size and type: 3 bits.
Defines the type, size and precision of operands, integer or floating point. The size for integer operands can be 8, 16, 32, 64, and optionally 128 bits. The precision of floating point operands can be single, double, and optionally quadruple precision. Half precision is not supported, except in conversion instructions.
Vector length: 3 bits.
Specifies the length of vectors in bits. Possible values are: scalar, 128, 256, 512, 1024, 2048, 4096, and max. Support for values above 128 are optional. The size of operands is as determined by the operand size/type when "scalar" is specified. A vector will contain as many elements of the specified operand size as can be contained in the vector size. For example, a 256 bit vector can contain 8 elements of 32 bits each. The "max" specification gives the largest vector size supported by the processor. This value depends on the processor and must be a power of 2. The minimum allowed value is 128, with no upper limit. The max value may be different for different operand sizes. A piece of software can take advantage of future extensions by specifying the max vector size. The max value can be reduced by settings in a control register.
Addressing modes: 2 bits.
The following addressing modes are defined for memory operands. An instruction can have no more than one explicit memory operand with this specifiation.
00: IP + index + 32 bits offset (specify r31 for no index)
01: base + 8 or 32 bits offset
10: base + index + 8 or 32 bits offset
11: base + index * operand size + 8 or 32 bits offset
The base and index registers are specified in register fields. The offset size (8 or 32 bits) depends on the instruction format. 8-bit offsets are always multiplied by the specified operand size in bytes. For example, an operand size of 32 bits = 4 bytes will multiply the value in the offset field by 4. 32-bit offsets are not multiplied by this factor. Offsets are always sign-extended. It is not required to support 16-bit or 64-bit offsets or absolute addressing.
Jumps and calls have an offset of 8 or 32 bits relative to the instruction pointer. This offset is multiplied by 4 because all instructions have sizes that are multiples of 4 bytes.
Address offset: 8 or 32 bits.
This field is used as specified above under addressing mode.
Immediate data operand: 8, 32 or 64 bits.
An 8-bit immediate value is interpreted as an integer, sign extended to the specified operand size. The signed value is converted to floating point if a floating point operation is specified.
A 32-bit or 64-bit immediate is interpreted as an integer for integer operations or a single or double precision float for floating point operations. The integer immediate constant is sign-extended if necessary. The floating point immediate constant is converted to the desired precision if necessary.
16-bit immediates are not necessarily supported.
Rounding mode: 2 bits.
Optionally specifies the rounding mode used in floating point operations and conversions. Possible values are: round to nearest or even, round down, round up, truncate towards zero. The default value if there is no rounding mode field is "round to nearest or even". This option field is useful in float-to-integer conversion instructions, but rarely needed in other contexts. May be included in long versions of floating point instructions.
Exception control: 1 bit.
Enables or suppresses interrupts in case of numerical errors. This can be used for controlling exceptions in case of overflow and other errors in floating point operations. Can also be used for checking for overflow in integer arithmetic. An unsigned integer compare instruction with exception enabled can be used for checking if an array index is out of bounds. This feature may be included in long versions of arithmetic instructions.
Broadcast: 1 bit.
If 1, specifies that the last source operand is a scalar to be broadcast into all the vector elements. (Unnecessary when this is an immediate operand).
Zero masked data: 1 bit.
Specifies whether masked-out elements are set to zero or left unchanged. This bit may replace the broadcast bit on instructions with only register operands, or it may be a separate bit.
Other fields.
Other fields may be added if specific features are needed. Alternatively, an immediate data field may be used for specifying additional feature options.
Formats.
Commonly used instructions may be implemented in several different formats and instruction lengths, preferably with the same value in the instruction code bits. For example, an addition instruction, A = B + C, might be implemented in the following formats:
- 3 registers (1 dword).
- 2 registers and a predicate (1 dword).
- 1 register and a predicate and an 8-bit immediate (1 dword).
- 1 register and a memory operand with base and 8 bit offset, scalar only (1 dword).
- 2 registers and a predicate and a 32-bit immediate (2 dwords).
- 1 register and a memory operand with base and 32 bit offset (2 dwords).
- 2 registers and a predicate and a memory operand with base and index and 32 bit offset (3 dwords).
- 2 registers and a predicate and a 64-bit immediate (3 dwords).
Combined ALU and conditional jump instructions may be implemented in the following formats:
- 2 registers of 32-bit integers only, a condition code and an 8 bit displacement (1 dword).
- 2 registers, a condition code and a 32 bit displacement (2 dwords).
- 1 register, an 8 bit immediate, a condition code and a 32 bit displacement (2 dwords).
3-input instructions, such as fused multiply-and-add may be implemented in the following formats:
- 3 registers and a predicate and option bits (2 dwords).
- 2 registers and a predicate and a memory operand with base and index and 32-bit offset and option bits (3 dwords).
Less commonly used instructions may be implemented in just one or a few different formats.
FPGA
The microprocessor can have an optional FPGA or similar programmable hardware. This structure is used for making application-specific instructions or functions, e.g. for coding, encryption, data compression, signal processing, text processing, etc. This reduces the need for hard-coded application-specific instructions.
If the processor has multiple CPU cores then each core may have its own FPGA. The hardware definition code is stored in its own cache for each core. The operating system should prevent, as far as possible, that the same core is used for different tasks that require different hardware codes. If this cannot be avoided then the code, as well as the contents of any memory cells in the FPGA, must be saved on each task switch. This saving may be implemented as lazy, i.e. the swap of contents is only made when the second task needs the FPGA structure that contains code for the first task.
Extensibility
The evolution of the x86 instruction set is full of short-sighted decisions with no room for future extensions. All kind of weird patches have been used to extend an instruction set that was never designed to be extensible. We must learn from these mistakes and consider the predictable need for future extensions when designing an instruction set.
There is reason to suspect that many of the instructions in the current x86 instruction set have been added with short-sighted marketing reasons in mind. Every new generation of microprocessors must have some new features that the competitors don't have, or new features that can be hyped to make customers buy the new version, according to the marketing logic. Some of these instructions are now obsolete, but still supported by the hardware.
The design of a stable instruction set should not be subject to competition and marketing whims, but designed by an open process with participation of the international hardware and software community, similar to the standardization work in other technological areas. A collective decision process reduces the risk of mistakes and short-sighted decisions.
The proposed instruction set is orthogonal, which reduces the number of different instructions. The inclusion of an FPGA reduces the need for application-specific instructions.
In addition to these considerations, it is necessary to add space for future extensions of the instruction set. Some of the instruction formats should have a large number of unused instruction code bits that can be used for future instructions or option bits. A few instruction format codes should be reserved for future extensions. All codes that begin with 111, i.e. 11 in the instruction length bits and 1 in the first bit of the instruction format field, should be reserved for future extensions. These bits could be used in the future either for more 3-dword formats with many instruction bits, or for 4-dword formats. This decision will be left to the future.
An attempt to execute an instruction with an unknown instruction code should cause an interrupt (exception) in most cases. This makes it possible to emulate new instructions on old microprocessors. In some cases, however, it is desired to make extensions that do not cause interrupts on microprocessors that don't support them. Historically, this has been done with extensions that affect performance, but not functionality, such as memory prefetching and branch prediction hints. This kind of extensions can be made by using various option bits in contexts where they previously made no sense, for example rounding mode bits in an integer instruction. Thus, the processor should ignore certain unused option bits in certain instructions to make this kind of performance extensions possible. Also, a small range of instruction codes should be reserved for future performance-tuning instructions, which will be ignored on processors that don't support them. To be more specific, we will have three categories of unused codes:
- Code reserved for future use. Generates interrupt so that it can be emulated.
- Code reserved for future use. Generates no interrupt, but behaves as a nop (no operation). Can be used for future purposes that allow the code to be ignored on processors that do not support it.
- Code guaranteed to never be used. Will generate interrupt also on all future processors. Can be used for application-specific emulation or forced error messages.
Extension of the vector register size is straightforward without the need to define any new instructions. This makes it possible for software to utilize a new extended vector size even without recompilation. The software can simply specify the maximum vector size and get information from a CPUID instruction about what this maximum vector size is.
We have seen in the history of x86 processors that the first processor generation to support a new and larger vector size has typically done so with poor efficiency. In most cases, it has used a half-size execution unit twice to do a full-size vector operation. This was not necessarily a bad design choice because the software that supports a new instruction set extension typically lags several years behind the hardware.
The situation is different with the extension mechanism proposed here. The software will be able to utilize a vector size extension immediately. The microprocessor should not allow a larger vector size than it can execute more efficiently than if software used the next smaller vector size twice. It may still be worthwhile to allow a vector size that is larger than the execution unit and use this unit multiple times to process a full-size vector. This will save bandwidth in the decoder and use fewer registers than the alternative of repeating the instruction in software. The CPUID instruction should provide complete information about this. This means that it should specify both the maximum vector size that can execute at full throughput and the maximum vector size that can execute at all. These values may be different for different types of operands.
Portability
The ABI, object file format, etc. should be standardized as far as possible in order to allow the same code to be compatible with different operating systems and platforms. This would make it possible, for example, to use the same function libraries in different operating systems. This can easily be achieved for libraries that are doing some mathematical operation and not using any system functions. A more ambitious goal is to establish portability even when some common system functions are involved, such as time functions or handling of multithreaded code. Most importantly, there should be a portable way of generating error messages from a function library. This could be obtained with an error message instruction. This instruction should generate an interrupt (throw an exception). A few register operands can contain codes indicating the type of error, and a memory operand can point to an error message text. All platforms should be able to handle this error condition in a way that is appropriate for the type of user interface. In console mode applications, for example, the error message should go to the stderr output. In graphical user interface (GUI) applications, the error message should be shown in a pop-up window, or whatever method is appropriate for the specific GUI framework.
Error messages should be in the English language by default, with an optional feature for multi-language support. We can expect the need for multi-language support to be decreasing. The problems with multi-language applications are discussed in this document https://en.wikibooks.org/wiki/Usabilit ... nalization.
ABI and calling conventions
This is an example of how an efficient ABI (Application Binary Interface) can be designed.
Function calls will use registers for parameters as far as possible. The first 24 parameters to a function are transferred in register r0 - r23. Any additional parameters are transferred on the stack in C language order. These parameters are removed from the stack by the caller. The function return value is in r0. Push and pop instructions are rarely used. The return instruction has no offset. The stack is kept aligned by 128 before any call instruction.
Variable argument lists are transferred on the stack with 64 bits for each argument.
Tuples: A structure or class or encapsulated array for which all non-static elements have the same type is transferred in a single vector register if the total size does not exceed 128 bits.
Parameters that do not fit into a single register are transferred by a reference to a memory object allocated by the caller. This applies to: structures and classes with elements of different types, or bigger than 128 bits, or having a non-standard copy constructor or destructor or virtual member function. It is the responsibility of the caller to call any copy constructor or destructor. Any parameters beyond the first 24 parameters are transferred in the same way as if they were in a register, using 64 or 128 bits of stack space, as appropriate.
A return value that does not fit into a register is transferred to a memory location allocated by the caller. A reference to this memory location is treated as the first parameter (before any 'this' pointer).
There are no registers with callee-save status in the case of static linking. This is because the called function does not know the maximum vector register size required by the caller. Instead, there is a mechanism that allows the caller to know which registers are modified by the called function. This information is stored in the object file for static link libraries. The object file format must support this information, which must be stored in the same file as the library function in order to make sure it has the right version. Compilers supporting "whole program optimization" can read this information from a library file before allocating registers.
This mechanism cannot be applied to dynamic linking. Instead, dynamic link libraries are prohibited from modifying certain registers.
The object file format should be a modified ELF format. Dynamic linking should use the Windows DLL method rather than the UNIX shared object method. The code uses position-independent addressing as far as possible. Any remaining relocation is performed at load time. Symbol imputation is not used. This eliminates the need for the inefficient global offset table (GOT) and procedure linkage table (PLT).
Information used for exception handling and stack unrolling should use a standardized and portable table-based method. Debugging information should also be standardized.
Assembly language syntax
The syntax for x86 assembly code has never been officially standardized, but each assembler uses its own dialect. The definition of a new instruction set should include the definition of a standardized assembly language syntax, preferably with the destination operand first.
Summary of advantages
The instruction set proposed here is a compromise between the RISC and CISC principles. A RISC instruction set with a fixed instruction size makes it easy to decode multiple instructions in parallel, but it is a vaste of precious code cache space. If the fixed instruction size is big enough to accommodate all possible instruction types, then it must necessarily be too big for the most common simple instructions and therefore take up too much space in the code cache. The code cache is a precious resource because it is impossible to make the cache bigger without also making it slower. A CISC instruction set with many different instruction lengths makes it difficult to decode multiple instructions in parallel, and this can be a serious bottleneck. The proposed instruction set has just a few standardized instruction lengths: one, two and three dwords of 32 bits each. The length of the instruction is determined by the first few bits. This makes it possible to determine the lengths of multiple instructions in a single clock cycle by a process that resembles the look-ahead carry mechanism in binary adders.
The instruction set is completely orthogonal. An ordinary arithmetic or logic instruction, such as e.g. addition, can have many different versions for different types of operands. It can handle integers of 8, 16, 32, 64, and possibly 128 bits, as well as floating point numbers of single, double, and possibly quadruple precision. The last source operand can be a register, a memory operand, or an immediate constant. The same instruction can handle scalars or vectors of any length. This makes programming simpler and reduces the number of different instructions.
There is only one type of register. The same registers can be used for many different purposes, including integers and floating point numbers of all different sizes and precisions, as well as for Booleans, flags, pointers and references. The registers can also be used for vectors and masks.
Many instructions can be predicated, so that the instruction is either executed or not, depending on a Boolean variable stored in a predicate register. The predicate mechanism can be vectorized, so that the operation is turned on or off for each element in a vector, depending on a mask register containing a vector of Booleans.
Instructions can have short versions that save space by using only two operands, by omitting certain option bits, by using an 8-bit scaled offset in a memory operand, or by using a signed 8-bit constant as the immediate operand. For example, a double precision floating point addition can have immediate operands of three different sizes: a signed 8-bit integer which will be converted to floating point, a single precision float, or a double precision float. This constant is available at an early stage in the CPU pipeline so that there is enough time for converting it to the necessary size and precision without delaying the execution. The need to fetch numeric constants from data memory is eliminated in most cases because numeric constants can be contained in the instructions. This will increase the speed and reduce the load on the data cache. In most cases, the code size will not be increased by the inclusion of numeric constants in the instructions because they replace the addresses (typically 32 bits) of constants stored in data memory, and because it will fit the constants into smaller formats whenever possible. Immediate constants can even be used with vector instructions where the same constant will be applied to all elements in the vector.
The size of the registers is not fixed in the design. It is possible to make bigger and more powerful microprocessors simply by making the registers bigger so that they can hold larger vectors. This mechanism is orthogonal as well. There are three bits in the instruction code which determines the vector length (or a scalar). This makes it possible to write software for future microprocessors with bigger vector registers that do not exist yet. Setting the vector length bits to 111 will give the largest vector size that the microprocessor supports, whatever this is. This makes it simple to support all vector sizes in the same piece of software. This feature also makes it possible to save an entire register even though the maximum register size is not known when the software is compiled. This can be useful in task switches, exception handlers, device drivers and system libraries. There is no limit to how big the maximum vector size can be. A CPUID instruction will tell the software what the maximum vector size is, and there will be a feature that enables a software program to reduce the maximum vector size if it is excessive.
The conventions for function calling, as well as other ABI details, should be specified together with the instruction set. This will improve compatibility and make it possible to use the same function libraries with different compilers, different programming languages, and different operating systems. There are 32 registers. This makes it possible to use registers for function parameters in almost all cases.
Can existing instruction sets be fixed?
The commonly used instruction sets can be divided into two general types, RISC and CISC. The RISC instruction sets generally have a more or less fixed instruction size. All instructions have the same number of bits. The advantage of a RISC design is that the fetching and decoding of instructions is simple and fast. The disadvantege is that commonly used simple instructions take more space than necessary while complicated instructions do not fit into the limited instruction size. Instructions that need many bits for addresses or constants do not fit into the RISC design.
A CISC instruction set has a variable instruction length. The advantage of this is that simple, commonly used instructions can be as small as a single byte, while more complex instructions or instructions with large addresses or constants can have a length that fits the purpose. This provides optimal use of the code cache. The disadvantage is that it is complicated to decode the instructions. Modern microprocessors can execute three or four instructions in parallel in a single clock cycle if no data dependence prevents this. But it is difficult to decode multiple instructions simultaneously when you have to determine the length of the first instruction before you know where the second instruction begins. Therefore, the bottleneck in a CISC processor is quite often decoding rather than execution.
The present article has proposed a compromise between RISC and CISC. The widely used x86 instruction set is a CISC design. Mosts other instruction sets in common use today are RISC designs.
x86 instruction set
The x86 instruction set has a long heritage dating back to the 8086 processor in 1978 where code density was of prime importance. It has been developed through many generations of additions and extensions without ever loosing backwards compatibility. It is a CISC instruction set where instructions can have any length from 1 to 15 bytes, and it is quite complicated to determine the length of each instruction. It has many different types of registers. The latest extension, AVX-512 has 16 general purpose registers of 64 bits each, 6 segment registers of which only 3 are used in 64-bit mode, 8 floating point registers of 80 bits each, 8 MMX registers of 64 bits each which are overlaid on the floating point registers, 32 vector registers of 512 bits each, 8 mask registers of 64 bits each, a flags register and an instruction pointer. This patchwork could certainly need a redesign. Can it be combined with the design principles that are proposed here?
An easy solution would allow the two kinds of code to be used interchangeably and mixed. The new instructions would have to use some bit patterns that are not already in use in the old system. The x86 instruction set has 20 byte-codes that are currently used only in 16-bit and 32-bit mode, mostly for obsolete instructions. These codes can be used for other purposes in 64-bit mode. Therefore, it is possible to make new extensions that can be used only in 64-bit mode. We would prefer 64-bit mode anyway, so it would be possible to make extensions that implement some of the principles described here and still preserve backwards compatibility, but this would still be only patches on a fundamentally flawed, inefficient and outdated design. The 20 unused code bytes are scattered around the code map with only few adjacent code bytes, so it would be impossible to use more than a few of these code bytes without making the whole system completely messy. Most of the bits in the first byte of any new code would therefore be fixed and unusable in such a hypothetical new code design.
A better solution would be to implement a separate mode for the new instruction design and a system for switching between the new mode and the legacy modes. The improvement in performance that can be obtained with a new instruction design is probably not high enough to justify the complications of a dual code system. Instead, we should be prepared to seize the opportunity in case the need for a major revision should arise for other reasons. It is not possible to make a decoder that translates the old codes to the new ones at runtime, because the new system does not support the many different types of registers that the old system has. A translation from the new system to the old one is also not possible. Instead, we would need two seperate decoders that translate the old and the new codes, respectively, to the internal micro-operation format. This micro-operation format probably needs to be expanded to make space for 64-bit immediate constants, but the extra bits can be disabled when they are not needed, in order to save power.
The existing execution units could relatively easily be modified to support the new code design. The 32 universal registers of the new design should obviously be identical to the 32 vector registers of the old design. Combined ALU-and-conditional-jump instructions are already implemented internally in both Intel and AMD processors even though they are not available as x86 instructions.
It is a problem that many current processors have their execution units divided into two main clusters: One cluster is connected to the general purpose registers and handles integer scalar operations, pointer addressing and jumps. The other cluster is connected to the floating point and vector registers and handles all floating point and vector operations. All transfers of data between these two clusters typically involve a delay of one clock cycle. This two-cluster design would be a problem for the new instruction set where all units need access to the same register file.
Itanium instruction set
The Itanium instruction set is a very ambitions RISC instruction set. Itanium instructions are joined into bundles with a fixed size of 128 bits, containing three instruction codes of 41 bits each and a 5-bit template. The three instructions in a bundle will execute in parallel. This explicit parallelism puts a lot of work on the compiler to schedule instructions that can execute in parallel without violating the program logic. The Itanium has a rotating register stack where each program function allocates the number of registers it needs. It has many other advanced features, such as explicitly speculative instructions. The itanium design has not been very successful, due mainly to the difficulties of making a suitable compiler. Another obstackle to the commercial success of the Itanium was a poor support for backwards compatibility with existing software. The Itanium system is so different from other systems that there would be little advantage in combining it with a new instruction set.
Other RISC instruction sets
Most other commonly used instruction sets today are RISC designs. These designs are generally simple and efficient. The instruction length is typically 32 bits. Some systems, such as ARM-Thumb-2 and AVR32 can use a mixture of short 16-bit instructions and longer 32-bit instructions. Most systems have several different register types. Some RISC processors support vector instructions with 128-bit vectors. There is a limit to the number of different instructions that can be coded in an instruction with a fixed size. It is a general problem with RISC instruction sets that they cannot support complex instructions with many option bits. This makes it difficult to add new options and features that the x86 instruction set has, such as masked vector operations, options for controlling rounding mode, etc. The limited instruction size of the RISC systems is also a problem where large addresses or large numeric constants are needed. It is not possible to define a large numerical constant or a jump to a distant address with a single instruction in a RISC design with a limited instruction size. Some of the RISC processors already have support for more than one instruction set and features for switching between these modes. An additional mode for a new instruction set could be added to these processors without serious problems.