A softcore processor for ForwardCom is currently available (model A, version 1.01) with support for all integer instructions. It is useful for embedded systems.
A manual for this softcore is provided at the github repository.
Instruction operands can be registers, immediate constants, or memory operands.
8-bit, 16-bit, 32-bit, and 64-bit integers, signed and unsigned.
half-precision, single-precision, and double precision floating point are not supported in this version.Variable-length vector registers are not supported in this version.
The maximum throughput is one instruction per clock cycle. The latency is one clock cycle for most instructions.
Multiplication and mul_add instructions have a latency of five clock cycles and a throughput of one multiplication per clock cycle. Division has a latency of three clock cycles plus one additional clock cycle for every two significant bits in the result. It is not possible to start a new division before a previous division is finished. Multi-register push and pop instructions take one clock cycle for each register plus a single additional clock cycle for adjusting the stack pointer. All other arithmetic and logic instructions have a latency of one clock cycle.
Unconditional direct jumps, calls, and returns have a latency of 2 clock cycles. Conditional jumps have a latency of 7 clocks when taken and 6 clocks when not taken. Indirect and multiway jumps and calls have a latency of 7 clocks.
Memory reads have a delay of 2 clocks after the modification of a pointer or index register that is needed in the address calculation. There is no delay for memory reads if address registers are not modified in the preceding two instructions. Memory writes have a similar delay if address registers are modified within the preceding two instructions. The same delay applies if the register holding the value to write is modified within the preceding two instructions.
All instructions start to execute in order, but they do not necessarily finish in order. Two instructions can finish in the same clock cycle. Multiple values of the same logical register can be in flight at the same time. There is no performance penalty for masked (predicated) instructions.
move: read, write, or move data.
push: save one or more registers on stack. Optional direction and operand size.
pop: restore one or more registers from stack. Optional operand size.
address: get address of variable or function
add, subtract, multiply, multiply high, divide, modulo, abs, min, max. signed and unsigned.
add_add: 3-operand add/subtract Y = ±A ±B ±C
mul_add: Y = ±A*B ±C
roundp2: round up or down to nearest power of 2.
and, or, xor, select_bits
truth_tab3: universal 3-operand boolean instruction, using truth table
shift left, shift right signed, shift right unsigned, funnel shift, rotate.
Set bit, clear bit, toggle bit, bit scan forward/reverse.
move_bits: universal bit field manipulation
popcount: count 1-bits.
Compare: <, <=, >, >=, ==, !=. signed and unsigned. Additional boolean operation can be added at no cost.
Test_bit: test an indicated bit. Additional boolean operation can be added at no cost.
test_bits_and: Test AND-combination of indicated bits. Additional boolean operation can be added at no cost.
test_bits_or: Test OR-combination of indicated bits. Additional boolean operation can be added at no cost.
Jump: direct or indirect jump. Multiway jump with index into table of relative addresses.
call: direct or indirect function call. Multiway call with index into table of relative function pointers
return: return from function.
add/subtract and branch if zero, positive, negative, overflow, carry; or the inverse of these.
compare and branch if <, <=, >, >=, ==, !=. signed and unsigned.
increment counter and branch if below specified limit (for loop).
subtract maximum vector length and branch is positive (vector loop).
test_bit: branch if an indicated bit is true/false
test_bits_and: branch if AND-combination of indicated bits is true/false
test_bits_or: branch if OR-combination of indicated bits is true/false