ForwardCom softcore

A softcore processor for ForwardCom is currently available (model A, version 1.01) with support for all integer instructions. It is useful for embedded systems.

A manual for this softcore is provided at the github repository.

Main features:

Code language: System Verilog
Runs on Nexys A7-100T FPGA board
One instruction per clock cycle
Maximum clock frequency 50 - 70 MHz, depending on configuration
32-bit or 64-bit registers
Data memory 32 kB. Code memory 64 kB. Call stack 1023 entries
RS232 serial interface for standard input and output
On-chip loader
On-chip debugger
On-chip event counter
Open source with free license

Operand types supported:

Instruction operands can be registers, immediate constants, or memory operands.

8-bit, 16-bit, 32-bit, and 64-bit integers, signed and unsigned.

half-precision, single-precision, and double precision floating point are not supported in this version.

Variable-length vector registers are not supported in this version.

Performance metrics

The maximum throughput is one instruction per clock cycle. The latency is one clock cycle for most instructions.

Multiplication and mul_add instructions have a latency of five clock cycles and a throughput of one multiplication per clock cycle. Division has a latency of three clock cycles plus one additional clock cycle for every two significant bits in the result. It is not possible to start a new division before a previous division is finished. Multi-register push and pop instructions take one clock cycle for each register plus a single additional clock cycle for adjusting the stack pointer. All other arithmetic and logic instructions have a latency of one clock cycle.

Unconditional direct jumps, calls, and returns have a latency of 2 clock cycles. Conditional jumps have a latency of 7 clocks when taken and 6 clocks when not taken. Indirect and multiway jumps and calls have a latency of 7 clocks.

Memory reads have a delay of 2 clocks after the modification of a pointer or index register that is needed in the address calculation. There is no delay for memory reads if address registers are not modified in the preceding two instructions. Memory writes have a similar delay if address registers are modified within the preceding two instructions. The same delay applies if the register holding the value to write is modified within the preceding two instructions.

All instructions start to execute in order, but they do not necessarily finish in order. Two instructions can finish in the same clock cycle. Multiple values of the same logical register can be in flight at the same time. There is no performance penalty for masked (predicated) instructions.

Instructions implemented

Data move instructions:

move: read, write, or move data.

push: save one or more registers on stack. Optional direction and operand size.

pop: restore one or more registers from stack. Optional operand size.

address: get address of variable or function

Arithmetic instructions:

add, subtract, multiply, multiply high, divide, modulo, abs, min, max. signed and unsigned.

Complex arithmetic instructions:

add_add: 3-operand add/subtract Y = ±A ±B ±C

mul_add: Y = ±A*B ±C

roundp2: round up or down to nearest power of 2.

Boolean instructions:

and, or, xor, select_bits

truth_tab3: universal 3-operand boolean instruction, using truth table

Bit manipulation instructions:

shift left, shift right signed, shift right unsigned, funnel shift, rotate.

Set bit, clear bit, toggle bit, bit scan forward/reverse.

move_bits: universal bit field manipulation

popcount: count 1-bits.

Compare and test instructions:

Compare: <, <=, >, >=, ==, !=. signed and unsigned. Additional boolean operation can be added at no cost.

Test_bit: test an indicated bit. Additional boolean operation can be added at no cost.

test_bits_and: Test AND-combination of indicated bits. Additional boolean operation can be added at no cost.

test_bits_or: Test OR-combination of indicated bits. Additional boolean operation can be added at no cost.

Jump and call instructions:

Jump: direct or indirect jump. Multiway jump with index into table of relative addresses.

call: direct or indirect function call. Multiway call with index into table of relative function pointers

return: return from function.

Other control transfer instructions:

add/subtract and branch if zero, positive, negative, overflow, carry; or the inverse of these.

compare and branch if <, <=, >, >=, ==, !=. signed and unsigned.

increment counter and branch if below specified limit (for loop).

subtract maximum vector length and branch is positive (vector loop).

test_bit: branch if an indicated bit is true/false

test_bits_and: branch if AND-combination of indicated bits is true/false

test_bits_or: branch if OR-combination of indicated bits is true/false