Page 1 of 1

Proposals for next version

Posted: 2022-10-17, 12:41:12
by agner
I am beginning to work on version 1.12 of ForwardCom. I have several proposals for changes that I will open up for discussion:

1. Speculative memory read
I would like an option for reading from memory without causing a trap in case the address is invalid. A read from an illegal address should produce zero in a speculative read.

This can be useful in several situations:
  • Speculative execution of a code branch before knowing whether it will be used. Here, I mean software speculation, not hardware speculation. The software will execute two branches and afterwords select one of the results with a conditional move rather than a branch. A read error in the discarded branch will be ignored.
  • Vectorizing a 'while' loop before the loop count is known. Assume, for example, that a while loop reads one vector element per iteration. We want to vectorize the loop and read, for example, 16 vector elements at a time into a vector register. After handling 16 elements, we find that the loop should be iterated only 5 times. The remaining 11 vector elements are discarded. We don't want a trap in case the unused vector elements happen to lie beyond the limit of readable memory.
  • Searching for the end of a zero-terminated string. This is probably the most common case of the above point. When implementing a strlen() function, we want to read one full-length vector register at a time and test if it contains a zero. We don't want an error trap if the full-length read happens to go beyond the limit of readable memory.
  • Popping a vector register off the stack. We don't know the length of the vector in advance. We can do the pop in two steps: first read the length, and then read the vector with this length. But we may want to save time by reading both the length and a maximum-length vector in a single memory operation. By this method we risk reading beyond the limit of readable memory.
The Mill computer can do speculative reads by tagging register values as invalid (see link). ForwardCom does not have such tags. Instead I think we can just make the values zero. NAN is also a possibility, but only for floating point values.

We may discuss how to implement speculative reads. One possibility is to have a global flag that suppresses memory read traps. This has the disadvantage that errors in non-speculative code may go undetected. Instead, I would like to have a dedicated instruction for speculative read. This instruction should read a maximum-length vector and set any invalid part to zero. It could be implemented as a variant of the vector pop instruction.

The solution with an instruction that reads a full vector register speculatively would cover the most relevant uses, but it has certain limitations: it does not work for general purpose registers, it does not support all addressing modes, and it does not support read-modify instructions.

2. Reduce the number of instruction codes
Some similar multi-format instructions can be merged to use the same opcode and be distinguished by option bits instead. This will free opcodes for future extensions.
  • max, min, max_u, min_u. These instructions could be joined together and have option bits for distinguishing between min and max and between signed and unsigned. There should be an additional option bit for specifying how to handle NAN inputs.
  • MUL_ADD and MUL_ADD2. These instructions have option bits anyway for specifying operand signs. An extra option bit could specify the order of operands that distinguish MUL_ADD and MUL_ADD2.
  • div and div_u. Signed and unsigned division. These instructions have option bits anyway for specifying rounding mode. div_rev would be similar.
  • rem and rem_u. Signed and unsigned remainder.
  • mul_hi and mul_hi_u. Signed and unsigned high part of extended multiply.
Opcode values for multi-format instructions is a limited resource. We can save 7 values if we make all of the above changes. The assembler will be able to insert the relevant option bits automatically. The disadvantage of this proposal is that the instructions will have to be coded in a two-word format if the option bits are not zero, while the current implementation allows a single-word encoding of instructions that do not need option bits for other purposes anyway.

3. Change naming of template fields
The field names IM2, IM3, and IM4 in the code templates are not unique. This should be changed to non-ambiguous names. This does not affect the code, only the documentation.

4. Optional dot product instruction
An optional vector instruction doing r[0] = ±a[0]*b[0] ±a[1]*b[1] with extended precision of the intermediate products. This will be useful for complex number multiplication, division, and dot products.

Your comments and suggestions for other improvements are welcome.

Re: Proposals for next version

Posted: 2022-10-21, 14:31:00
by HubertLamontagne
1. Speculative memory read:

I'm not totally sure that this is a net win. I think there arguments for and against:
- Speculative memory reads can be partly emulated for vectorized code on paged memory architectures. As long as the first byte read in a memory page happens for real, then we know that all the subsequent reads in the same 4kb memory page are automatically valid. AFAIK this property is frequently used in library code for stuff like string functions etc... Check alignment, check that the first read happens, then just plow ahead unrolled all the way to the next 4kb page boundary and you're guaranteed not to have page faults even if you over-read. (forwardcom doesn't have 4kb paging but I imagine it has some other mechanisms that would allow for this kind of stuff)

- Itanium ran into big problems with this sort of design (it made out-of-order implementations difficult). Though it went much further with all the predicated, checked, speculative, advanced-load etc operations...

- On out-of-order processors, conditional loads, masked loads, branch-then-load and prefetch loads have similar latency-mitigation properties.

- Afaik, load traps are heavily used in applications like CPU virtualization. You'd have to set a flag that turns speculative loads into normal loads in those cases. Or it might be possible to avoid this if you could have speculative loads return a flag or a bitfield indicating load success, and then retry if it doesn't pass. Or alternatively, follow up speculative loads with a real load after computing, and only write the result if the late real load matches with the early speculative load. This requires somewhat more programmer discipline, and it still leaks just a bit of privileged data (which parts of memory are directly accessible or not, though this might not be a real problem).

2. Reduce the number of instruction codes : Sounds good to me

3. Change naming of template fields : Sounds good to me

4. Optional dot product instruction : Sounds good to me. It overlaps somewhat with multiply - then - pairwise-add, perhaps it could share hardware with those operations.

Re: Proposals for next version

Posted: 2022-10-21, 16:32:26
by Kulasko
These are my current thoughts:

1. Speculative memory read
This would eliminate the need of padding space as in chapter 9.1 in the current specification. As far as I understood, the ForwardCom linker (and maybe also the assembler) currently allocates a static memory region between constants and code as padding space. This effectively limits the "safe" size of vectors, which goes against the idea of vector size independent code. Therefore, I generally am in favor of this suggestion.

As for the actual implementation, I am not sure what the best approach is.

2. Reduce the number of instruction codes
Requiring option bits from the E template for anything even remotely common would inflate the code size of simple operations. I believe that this would be optimizing in the wrong direction, negating some gains from putting more work in an average instruction.

Aside from just not doing it, I see two alternatives to this suggestion:
  • Introducing option bits into one or multiple single-word formats, or generally using an immediate field as option bits. This has the advantage of enabling instructions requiring only two registers and no immediate field to still be coded in a single instruction word.
  • Allocating multi-format instruction space using OP2 in the E templates. This might require rearranging the few single format instructions currently using them, maybe changing multi-word formats as a result.
Also, I generally would like to see some form of tiny instructions reintroduced, but only after a lengthy period of analyzing written programs. I believe that aggressively optimizing the most common instructions will yield the highest net efficiency.

3. Change naming of template fields
I really like this suggestion and I see absolutely no caveats in it.

4. Optional dot product instructions
I believe that with the way current CPU computing goes, this is a good idea to implement. This also fits with the more-work-per-instruction philosophy.


In addition to this, I found a few noteworthy things during writing ForwardCom programs:
A. There is a horizontal boolean combination instruction in "bool_reduce"
I would be interested in the rationale of supporting this when other horizontal combinations aren't. I can imagine a few things:
  • They only need two wires per element, making them more economical than sum or product instructions.
  • They can save multiple combinations in a single element, saving an additional register when both OR and AND combinations are needed.
In any case, since boolean vectors in ForwardCom are sparse, this instruction still displays variable latency and needs additional hardware in the ALUs.

B. Combining vector loops with normal loops doesn't work to well
A use case would be an algorithm that combines each element inside an array with each element beyond itself in the same array, as seen in my "n_body" benchmark program.
Since efficient vector loops are more or less forced to use negative offset addressing and scalars don't have access to it, not even the base pointer can be shared. A bit more flexibility would go a long way in this case.

C. Vector for loops are confusing in assembly
They are currently written like so:

Code: Select all

for (float64 v0 in [r0 - r1]) {}
It is not clear what they actually assemble to. Only after assembling and disassembling again, I found out that it doesn't touch the vector register, nor the base pointer.
There are at least two possible solutions to this:
  • Make the vector loop perform a memory read in order to use all given information.
  • Redefine the syntax to only mention the length. An example:

    Code: Select all

       for length r0 {}  // keyword for vector loops
       for (int64 r0) {} // implicit vector, explicit type
       

Re: Proposals for next version

Posted: 2022-10-21, 17:51:24
by Kulasko
Oh, I forgot one thing.
D. No gather & scatter instructions
Now these would probably be implemented by a state machine splitting them into individual loads and stores, they are not really needed when hand-coding assembly, and their efficiency gain is debatable. Even so, compilers love these instructions for auto-vectorization.
In higher level code, there often are loops working on an array of objects, using just one or two fields. Compilers tended to not vectorize them at all, but do so with these instructions. I think that in ForwardCom, we want to give a program as much opportunities to run vector code as possible.

Re: Proposals for next version

Posted: 2022-10-23, 14:06:55
by agner
Thank you for your comments.

Regarding speculative read
The proposed speculative read should be a special instruction for this purpose. Advantages: no waste of RAM for padding space. Disadvantages: messy hardware logic; does not allow all addressing modes.

The alternative is to insert a padding space whenever a readable memory section is not followed by another readable section. The padding space may have a fixed size or a size equal to the maximum vector length. A fixed size is easier to implement, but not optimal. A padding space equal to the maximum vector length cannot be inserted by the linker if the linker does not know the maximum vector length. Instead, it can be inserted by the loader if we assume that the loader knows the maximum vector length. In this case, the executable file must contain relocation records indicating all relative links between the read-only data section and a non-readable code section. The executable file already contains this information in the case that it is re-linkable. The compiler may perhaps insert information about whether this padding space is needed or not.

The traditional solution is to rely on memory page boundaries. The strlen function does this in current systems. This will not work well on ForwardCom because there are no fixed-size memory pages. We could specify a minimum granularity size for memory sections. This would provide a workable solution for the strlen function, but it would be too complicated to use for the general case of vectorizing a while() loop where the loop count is not known in advance.

The Mill computer has never been implemented in hardware, AFAIK, so we don't know how efficient its speculative read is.

Reducing the number of instruction codes
You are right, Kulasko, that some instructions will need a double-size instruction format as a consequence of this, though I have chosen some less common instructions. Maybe it is better to reserve the compact instruction formats for the current instructions and let future instructions have double size instead. This assumes that the most common instructions are the ones already defined.

I don't see any useful potential for tiny (half size) instructions. Tiny instructions packed into pairs have only 14 bits for operands and instructions. My initial work showed that they were not very useful. A model analogous to ARM thumb instructions has 16 bits for each instruction, but then you need extra instructions for switching between normal and thumb instructions and you need a smaller granularity in the code cache and code addresses. Thumb instructions have very little space for operands, constants, and addresses so that you will need more instructions for doing the same thing. I think that the system with more work per instruction is more efficient.

Optional dot product instruction
The main purpose of this instruction is not to do more work per instruction, but to solve a precision problem. When I code a complex number multiply function using fused multiply-and-add, I have the problem that one multiplication has higher precision than the other so that complex number products A*B and B*A may give slightly different results. It is more complicated to get full precision and assure A*B = B*A. Intel's new AVX512-FP16 instruction set has a complex number multiply instruction, but it is reducing the precision to avoid A*B ≠ B*A.

Horizontal bool reduce
I have defined this instruction because it is often needed and it is not too costly to implement in hardware. In vector code, we often have to check for special conditions that require extra code. We can use a horizontal bool instruction to check if any of the vector elements need the special condition branch. If not, then we can skip a possibly costly branch.

It may be wise to split this instruction into a horizontal 'and' and a horizontal 'or' instruction. In this way, we need only one cross-cutting wire per vector element rather than two. I assume that it is rare that we need both horizontal 'and' and 'or' on the same vector.

Horizontal addition instructions, and the like, are too costly to implement and they would have variable latency. Horizontal addition exists in x86 but it is less efficient than simpler alternatives.

Vector for loops are confusing in assembly
It is true that the vector loop instruction is touching only the index register. The reason for the current syntax is that a vector loop will always need at least one data register and one base pointer. The instruction shows these registers to the reader. It does not show if there are more than one data register and more than one base register. The description in the manual can be improved, I admit.

Combining vector loops with normal loops
You can easily combine a vector loop with other kinds of loops. An extra array that has a different offset, a different element size, or a different stride, can be accessed by setting up a pointer to the first element and incrementing this pointer inside the loop. This is no different from traditional systems. The vector loop instruction gives you the extra advantage of adjusting the vector length and fitting the last slice with a smaller vector.

Gather and scatter
Gather and scatter instructions are complicated to implement in hardware. The advantage is dubious because the throughput is limited by the available memory ports anyway. It is more efficient to use permutation instructions if the memory range is limited to the size of one or a few vector registers.

It is possible to add gather and scatter instructions to a specific ForwardCom implementation if you are willing to throw enough hardware at it, but I don't think is should belong to the standard ForwardCom instruction set.

Re: Proposals for next version

Posted: 2022-11-30, 18:12:20
by HubertLamontagne
High/Low Priority Hint
One idea that I don't know if it makes any sense would be to add a hint to instructions to tell if they should run with high priority or low priority. High priority would be used for instructions whose result would influence conditional jumps and memory addressing, and low priority would be used for instructions whose results are just going to be stored to memory. This would let conditionals and memory load/store operations run as fast as possible, reducing the amount of speculative execution needed to run.

I'm mostly playing with this idea in the context of an architecture that would be out-of-order but not speculative... Without speculation, load/store operations have to be initiated as early as possible because they might generate an access fault, and no downstream instruction can run until the address has been determined, leading to massive stalls unless addressing runs at very top priority...

Re: Proposals for next version

Posted: 2022-12-02, 8:02:12
by agner
Hubert, the idea of a priority hint makes sense. A NOP instruction has many unused bits that can be used as various hints. On the other hand, if the compiler knows which instructions should have high priority, it would simply put these instructions first. An out-of-order processor will reorder instructions only if it has a reason to do so. I wonder if there are any situations where an out-of-order processor will reduce performance by reordering unless there are hints telling it not to?

Re: Proposals for next version

Posted: 2022-12-02, 18:56:22
by HubertLamontagne
The compiler can't always reorder instructions in order of priority... for instance:

int32 r0 = [r1] // high priority
int64 r4 += r5 // high priority
int32 r0 += 9
int32 r0 *= 7
int32 r0 >>= 2
int32 r0 += 1
int32 r0 |= 32
int32 r0 >>= 1
int32 r0 += 1
int32 [r1] = r0 // high priority
int32 r2 = [r3+r4*4] // high priority

The first 2 instructions and last 2 instructions should be high priority (load and store instructions, plus address determination) while the other instructions are low priority. In theory you'd want to run the last load (int32 r2 = [r3+r4*4]) as fast as possible, but it's only possible if r1 and r3+r4*4 don't point to the same memory address, which means that unless the compiler's alias analysis can prove that r1 and r3+r4*4 can never be the same, the last load operation can't be run early. As far as I can tell, a big part of why out-of-order CPUs have been so successful is that they can deal with [r1] vs [r3+r4*4] aliasing in real time, and run the load/store/load operations pretty much back-to-back and defer the math operations and the storing of the calculated value.

On bigger out-of-order CPUs, the cpu can even reorder all 3 of the load and store operations, which makes priority hints useless I think, so I'm not sure any of this makes sense indeed.

Re: Proposals for next version

Posted: 2022-12-03, 6:31:31
by agner
You are right that speculation is quite expensive in terms of hardware. ForwardCom needs less speculation because there are no numerical error traps. A processor with out-of-order capabilities but no speculation might be attractive. (It could decode both sides of a 2-way branch to reduce pipeline bubbles).

Hubert wrote:
you'd want to run the last load (int32 r2 = [r3+r4*4]) as fast as possible
What about prefetch? ( int32 prefetch ([r3+r4*4]) )

Re: Proposals for next version

Posted: 2023-01-26, 9:44:02
by Moonchild
I would like to see a vector sum scan instruction, at least for floating-point numbers; possibly also integers, but the advantage is not so clear there.

Re: Proposals for next version

Posted: 2023-01-26, 13:38:46
by agner
Moonchild wrote:
vector sum scan instruction
I don't understand what you mean. The manual explains how to make a horizontal vector sum.

Re: Proposals for next version

Posted: 2023-01-29, 15:54:35
by HubertLamontagne
I've played around with how to reduce the stalls from not knowing if instructions are going to run or not... and I've come up with "on the fly conditionalization"... for instance, the code:

int32 r0 = [r1]
int32 r2 = r3 + r4

You'd want to run r2=r3+r4 to run early, but you can't because the memory read might have an access fault and it might never run. But what you can do is have the front end of your cpu turn it into a conditional operation:

int32 r0 = [r1], set flag1 true if this operation runs and doesn't page fault
int32 r2 = if flag1, r3 + r4, else r2

The difference with the conditionalized version is that once the cpu's instruction decoder has turned it into a conditional operation, it's 100% guaranteed to run, regardless of what happens with the memory load or any previous operations. For register renaming, this lets you already permanently rename registers in the front end, instead of having to speculatively rename registers and have a rollback mechanism. I guess this is more micro-architecture than architecture though, and isn't specifically bound to forwardcom, and is more a way of making out-of-order execution easier.

Re: Proposals for next version

Posted: 2023-01-30, 6:53:28
by agner
Good idea, Hubert

I have thought about a cheaper solution, though less universal. ForwardCom is intended to avoid memory fragmentation. In small systems and embedded systems you may not need demand paged memory. If you don't have demand paging you don't need to be able to recover from memory access faults. You can simply 'panic' like in Rust. You will not need recoverable software traps at all. Floating point errors are signaled by NANs rather than traps. If you have no traps you don't need speculation at all. This makes the hardware much simpler, but of course only in systems where demand paging can be avoided completely.

Re: Proposals for next version

Posted: 2023-01-31, 8:17:35
by Moonchild
agner wrote: 2023-01-26, 13:38:46 Moonchild wrote:
vector sum scan instruction
I don't understand what you mean. The manual explains how to make a horizontal vector sum.
Sum scan of an array [a, b, c, d, ...] is [a, a+b, a+b+c, a+b+c+d, ...]. Having an instruction to do this on a single vector is useful in its own right, and can be used to build the full-array version more efficiently.

Re: Proposals for next version

Posted: 2023-01-31, 9:19:39
by agner
@Moonchild:
Sum scan of an array is a complex instruction with long data-dependent latency. This would require microcode. ForwardCom avoids microcode.

Re: Proposals for next version

Posted: 2023-01-31, 10:01:21
by Moonchild
Why does it require microcode? I mean, microcode would presumably not be faster than a software solution; but presumably it _is_ possible to accelerate with dedicated hardware, no?

Re: Proposals for next version

Posted: 2023-02-01, 5:57:39
by agner
@Moonchild

It is too complex for a single instruction. It requires many additions and horizontal data transfer across vector lanes. The latency would depend on the vector length.

People could propose a zillion other functions that they would like to implement in hardware, including elementary mathematical functions. The possible gain in performance should be weighed against the increase in hardware complexity. High complexity requires high silicon area, which means lower clock frequency.