Forwardcom possible execution pipeline?
Moderator: agner
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Forwardcom possible execution pipeline?
I'm looking at the Forwardcom ops and trying to figure out what the pipeline of a typical implementation would look like... Here's what I have yet:
Stage 1: Align
The goal of this stage is to take the raw current and previous icache lines (presumably something like 16bytes~ each) and align individual instructions to instruction decoders, using the IL field and detecting tiny ops. This probably requires a type of simple prefetch queue (because opcodes can straddle cache line boundaries).
Stage 2: Decode into micro-ops
Looking at the instruction set, a typical implementation would need 5 micro-op queues: ALU, memory, Vector, plus ALU-Recombine and Vector-Recombine for all operations with masks and interrupts in order to combine newly calculated results and issue faults if necessary. This stage must also figure out which register port accesses are present for the next stage (rename), which means it must differentiate all 0, 1, 2 or 3 operand operations (ex: mov is different from add because one of the register file ports is inactive). Which micro-ops are present for a given instruction depends on the format:
- ALU: 0.0, 0.1, 1.0, 1.1, 1.4, 1.5, Tiny 1..13, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.6, 2.0.7, 2.1, 2.5, 2.0 M=1, 2.1 M=1, 3.0.0, 3.0.2, 3.0.3, 3.0.7, 3.1, 3.0 M=1, PLUS all vector ops that use a mask because of the dependency for rounding modes...
- Memory: 0.4, 0.5, 0.6, 0.7, 0.0 M=1, 0.1 M=1, Tiny 14 15 30 31, 2.0.0 M=0, 2.0.1 M=0, 2.0.2 M=0, 2.0.3 M=0, 2.1 M=0, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.4, 3.0.0 M=0, 3.0.2 M=0, 3.0.3 M=0, 3.2.0, 3.2.1, 3.2.2, 3.2.3; Probably needs an extra micro-op for all instructions with a variable size operand (which includes all Vector+Memory ops...) and variable range checks (though it could possibly be done in the recombine pipeline too)
- Vector: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.2, 1.3, Tiny 16..31, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.2.6, 2.2.7, 2.3, 2.4, 2.6, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.7, 3.3
- ALU-Recombine: All ALU ops if NUMCONTR bits 6 or 7 are set or any mask byte is 0; all ALU ops that use a mask
- Vector-Recombine: All Vector ops if NUMCONTR bits 6 or 7 or 26-27-28-29 are set or any mask byte is 0; and all vector ops that use a mask
Stage 3: Rename and queue to execution units
From here on the pipeline is probably going to be a lot more like other CPUs
Stage 4..N: Execution
Stage 1: Align
The goal of this stage is to take the raw current and previous icache lines (presumably something like 16bytes~ each) and align individual instructions to instruction decoders, using the IL field and detecting tiny ops. This probably requires a type of simple prefetch queue (because opcodes can straddle cache line boundaries).
Stage 2: Decode into micro-ops
Looking at the instruction set, a typical implementation would need 5 micro-op queues: ALU, memory, Vector, plus ALU-Recombine and Vector-Recombine for all operations with masks and interrupts in order to combine newly calculated results and issue faults if necessary. This stage must also figure out which register port accesses are present for the next stage (rename), which means it must differentiate all 0, 1, 2 or 3 operand operations (ex: mov is different from add because one of the register file ports is inactive). Which micro-ops are present for a given instruction depends on the format:
- ALU: 0.0, 0.1, 1.0, 1.1, 1.4, 1.5, Tiny 1..13, 2.0.0, 2.0.1, 2.0.2, 2.0.3, 2.0.6, 2.0.7, 2.1, 2.5, 2.0 M=1, 2.1 M=1, 3.0.0, 3.0.2, 3.0.3, 3.0.7, 3.1, 3.0 M=1, PLUS all vector ops that use a mask because of the dependency for rounding modes...
- Memory: 0.4, 0.5, 0.6, 0.7, 0.0 M=1, 0.1 M=1, Tiny 14 15 30 31, 2.0.0 M=0, 2.0.1 M=0, 2.0.2 M=0, 2.0.3 M=0, 2.1 M=0, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.4, 3.0.0 M=0, 3.0.2 M=0, 3.0.3 M=0, 3.2.0, 3.2.1, 3.2.2, 3.2.3; Probably needs an extra micro-op for all instructions with a variable size operand (which includes all Vector+Memory ops...) and variable range checks (though it could possibly be done in the recombine pipeline too)
- Vector: 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 1.2, 1.3, Tiny 16..31, 2.2.0, 2.2.1, 2.2.2, 2.2.3, 2.2.4, 2.2.6, 2.2.7, 2.3, 2.4, 2.6, 3.2.0, 3.2.1, 3.2.2, 3.2.3, 3.2.7, 3.3
- ALU-Recombine: All ALU ops if NUMCONTR bits 6 or 7 are set or any mask byte is 0; all ALU ops that use a mask
- Vector-Recombine: All Vector ops if NUMCONTR bits 6 or 7 or 26-27-28-29 are set or any mask byte is 0; and all vector ops that use a mask
Stage 3: Rename and queue to execution units
From here on the pipeline is probably going to be a lot more like other CPUs
Stage 4..N: Execution
Re: Forwardcom possible execution pipeline?
Thanks for your detailed analysis.
I think it is more efficient to not split instructions into micro-ops. The same entry in the pipeline or reservation station will access address generation unit, read memory operands, wait for missing operands, and go to an execution unit. You are right that the decoder needs to detect the number of operands. But the assembler is actually filling all unused register fields with the same number as any used register operand to avoid false dependencies.
I don't think we need any recombination of results. All unused parts of a destination register are set to zero by design, rather than left unchanged. A masked-off part of the destination is not left unchanged, but set to a fallback value. The fallback value can be specified if there is a vacant register field, i.e. if an instruction with less than three operands use a format that allows three operands. The fallback value is equal to the first source operand otherwise.
I don't think it will be worthwhile to distinguish between instructions that can generate traps and those that cannot, because the mask value may not be known in advance.
I have started to make an emulator now, and it is pretty straightforward. It will be easier to make the FPGA code when the logic of the emulator can be reused. (It was much more complicated to make the linker with relinking and other novel features).
I think it is more efficient to not split instructions into micro-ops. The same entry in the pipeline or reservation station will access address generation unit, read memory operands, wait for missing operands, and go to an execution unit. You are right that the decoder needs to detect the number of operands. But the assembler is actually filling all unused register fields with the same number as any used register operand to avoid false dependencies.
I don't think we need any recombination of results. All unused parts of a destination register are set to zero by design, rather than left unchanged. A masked-off part of the destination is not left unchanged, but set to a fallback value. The fallback value can be specified if there is a vacant register field, i.e. if an instruction with less than three operands use a format that allows three operands. The fallback value is equal to the first source operand otherwise.
I don't think it will be worthwhile to distinguish between instructions that can generate traps and those that cannot, because the mask value may not be known in advance.
I have started to make an emulator now, and it is pretty straightforward. It will be easier to make the FPGA code when the logic of the emulator can be reused. (It was much more complicated to make the linker with relinking and other novel features).
Re: Forwardcom possible execution pipeline?
This is a sketch i made while trying to follow the proposal inside the manual. It lacks in detail, but gives a general idea how a ForwardCom CPU core might look like. It is microcode-less, but has a fairly complex pipeline compared to a pure RISC design.
Re: Forwardcom possible execution pipeline?
Thank you Kulasko.
This is very similar to what I have in mind. The number of parallel units may wary of course. Memory write may be after the ALU's, but there are few, if any, instructions that use both ALU and memory write.
This is very similar to what I have in mind. The number of parallel units may wary of course. Memory write may be after the ALU's, but there are few, if any, instructions that use both ALU and memory write.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Forwardcom possible execution pipeline?
Kulasko :
Looking at your pipeline, Load+ALU ops being scheduled twice (once in the Address scheduler going to the AGU, once in the Scheduler going to the ALUs) has the same effect as having a separate micro-OP for the load and the ALU part. This indeed solves the reason for having split micro-ops (the memory operations need to happen in a different order than ALU operations to be efficient, especially if you want to do a core where AGU operations run in-order to make the pipeline simpler).
Agner is right in that the actual memory storing to data cache and L2 cache needs to happen in the retirement unit of the pipeline - because otherwise it's impossible to rollback a memory write, and most written values are known in roughly ALU processing order.
It's true that recombination micro-ops are only needed if register file read ports are a limited resource, and I dunno how true this is in modern tech.
The trade-off is very often different between the integer ALU and the vector unit. For the integer ALU, port economy is important because so many operations read the general purpose regfile, and you can't add latency cycles without making everything slower, and non-masked 2-operand operations typically dominate classic scalar integer code so it makes sense to optimize for that.
For the vector unit, latency is less much less important since you're running downstream from the data cache and often even from the L2, and you don't have to compete against all the memory addressing operations for ports, so an extra cycle of latency to make the register file super wide isn't that much of a problem - plus you always have the option of halving the number of vector ALUs but doubling the bitwidth of your vectors.
Looking at your pipeline, Load+ALU ops being scheduled twice (once in the Address scheduler going to the AGU, once in the Scheduler going to the ALUs) has the same effect as having a separate micro-OP for the load and the ALU part. This indeed solves the reason for having split micro-ops (the memory operations need to happen in a different order than ALU operations to be efficient, especially if you want to do a core where AGU operations run in-order to make the pipeline simpler).
Agner is right in that the actual memory storing to data cache and L2 cache needs to happen in the retirement unit of the pipeline - because otherwise it's impossible to rollback a memory write, and most written values are known in roughly ALU processing order.
It's true that recombination micro-ops are only needed if register file read ports are a limited resource, and I dunno how true this is in modern tech.
The trade-off is very often different between the integer ALU and the vector unit. For the integer ALU, port economy is important because so many operations read the general purpose regfile, and you can't add latency cycles without making everything slower, and non-masked 2-operand operations typically dominate classic scalar integer code so it makes sense to optimize for that.
For the vector unit, latency is less much less important since you're running downstream from the data cache and often even from the L2, and you don't have to compete against all the memory addressing operations for ports, so an extra cycle of latency to make the register file super wide isn't that much of a problem - plus you always have the option of halving the number of vector ALUs but doubling the bitwidth of your vectors.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Forwardcom possible execution pipeline?
Looking at the specs, there's one feature that's going to generate tons of pipeline hardware complexity and that I'm 99% sure will never ever get used in any code:
Mask bit #6 "Generate a trap if unsigned integer overflow"
Mask bit #7 "Generate a trap if signed integer overflow"
- No language other than LISP uses these traps.
- This guy https://news.ycombinator.com/item?id=8767751 says that it can cost "5%+" of clock speed or even more
- This forces every single ALU operation to have specialized hardware to generate both signed and unsigned overflow bits and produce a result that will 99% of the time be ignored. This is particularly bad for multiplication because it forces you to do a super hyper wide multiplication (64*64->128 instead of 64*64->64), on the off chance that there is a mask and a trap dependency using it. This is also bad for left shift and arithmetic left shift because you now need a second barrel shifter just to calculate the occasional potential trap.
- It makes mul_add different from the composition of mul and add because mul could overflow but then the overflow could be solved by the add after it - for instance 65537 * 32769 would trap but 65537 * 32769 - 100000 would not. This makes it impossible to break mul_add into 2 micro-ops for processor cores where it would be beneficial (for exampler smaller cores). It also makes it impossible to break down masked arithmetic operations into parts for pipelining reasons.
- This prevents compilers from optimizing stuff like if(bitfield & 1) a += 8; because the top bits could contain random data and cause a crash
- It makes the physical register file larger because the trap conditions or overflow values need to be saved. Granted, this only makes it 66bits instead of 64bits, but it's generating heat and making circuits bigger for something that will see little use.
- If this slows down the pipeline, you cannot do the Intel/AMD trick of replacing it with a multiple-operation handler because you'd have to do it on every single masked instruction since the mask value can be known extremely late (like, it could wait until L2 cache even).
- All operations downstream from every single integer instruction become conditional. This adds more logic in the retirement unit because any other operation that retires on the same cycle has to check for integer faults now. Floating point exceptions cause a similar issue.
- This adds a whole "interrupt handling from integer faults" path all over the cpu, that will take design and verification time away from other actually useful stuff.
Mask bit #6 "Generate a trap if unsigned integer overflow"
Mask bit #7 "Generate a trap if signed integer overflow"
- No language other than LISP uses these traps.
- This guy https://news.ycombinator.com/item?id=8767751 says that it can cost "5%+" of clock speed or even more
- This forces every single ALU operation to have specialized hardware to generate both signed and unsigned overflow bits and produce a result that will 99% of the time be ignored. This is particularly bad for multiplication because it forces you to do a super hyper wide multiplication (64*64->128 instead of 64*64->64), on the off chance that there is a mask and a trap dependency using it. This is also bad for left shift and arithmetic left shift because you now need a second barrel shifter just to calculate the occasional potential trap.
- It makes mul_add different from the composition of mul and add because mul could overflow but then the overflow could be solved by the add after it - for instance 65537 * 32769 would trap but 65537 * 32769 - 100000 would not. This makes it impossible to break mul_add into 2 micro-ops for processor cores where it would be beneficial (for exampler smaller cores). It also makes it impossible to break down masked arithmetic operations into parts for pipelining reasons.
- This prevents compilers from optimizing stuff like if(bitfield & 1) a += 8; because the top bits could contain random data and cause a crash
- It makes the physical register file larger because the trap conditions or overflow values need to be saved. Granted, this only makes it 66bits instead of 64bits, but it's generating heat and making circuits bigger for something that will see little use.
- If this slows down the pipeline, you cannot do the Intel/AMD trick of replacing it with a multiple-operation handler because you'd have to do it on every single masked instruction since the mask value can be known extremely late (like, it could wait until L2 cache even).
- All operations downstream from every single integer instruction become conditional. This adds more logic in the retirement unit because any other operation that retires on the same cycle has to check for integer faults now. Floating point exceptions cause a similar issue.
- This adds a whole "interrupt handling from integer faults" path all over the cpu, that will take design and verification time away from other actually useful stuff.
Re: Forwardcom possible execution pipeline?
Regarding integer fault traps.
Yes, I would love to avoid fault trapping for both integer and floating point calculations altogether. In addition to the problems that Hubert point to, there is the problem that the behavior depends on vector length. A trap may happen at different times in a loop sequence depending on the vector length, and you may have multiple faults of different kinds happening in the same vector instruction. Another complication is that add-and-conditional-jump instructions need to support fault trapping because the assembler or compiler may fuse an add instruction and a conditional jump together into a single instruction.
A good alternative to trapping floating point errors is NAN propagation. I am currently fighting with the IEEE standard people to convince them to fix the vague rules for NAN payload propagation.
I have discussed the possible solutions for integer overflow in the manual. We need some way of detecting integer overflow because especially the C/C++ language makes overflow check particularly nasty (The gcc compiler can optimize away an overflow check because overflow is officially undefined). A global sticky overflow flag is out of the question because it doesn't work well with out-of-order processing and vectorization. I have provided the following experimental solutions in ForwardCom:
Yes, I would love to avoid fault trapping for both integer and floating point calculations altogether. In addition to the problems that Hubert point to, there is the problem that the behavior depends on vector length. A trap may happen at different times in a loop sequence depending on the vector length, and you may have multiple faults of different kinds happening in the same vector instruction. Another complication is that add-and-conditional-jump instructions need to support fault trapping because the assembler or compiler may fuse an add instruction and a conditional jump together into a single instruction.
A good alternative to trapping floating point errors is NAN propagation. I am currently fighting with the IEEE standard people to convince them to fix the vague rules for NAN payload propagation.
I have discussed the possible solutions for integer overflow in the manual. We need some way of detecting integer overflow because especially the C/C++ language makes overflow check particularly nasty (The gcc compiler can optimize away an overflow check because overflow is officially undefined). A global sticky overflow flag is out of the question because it doesn't work well with out-of-order processing and vectorization. I have provided the following experimental solutions in ForwardCom:
- integer instructions with overflow check. This uses vector registers where the even-numbered elements are used for the actual calculations while the odd-numbered elements are used for propagating overflow flags.
- instructions for saturated integer arithmetic
- add-and-jump-if-overflow instructions
- mask bits for enabling signed and unsigned overflow trapping
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Forwardcom possible execution pipeline?
I think the best approach is just to check between the old and the new value for the wraparound and branch if it happens. Sure, it's going to take you a few arithmetic operations to check everything out (1 or 2 if the operand is an immediate, 4~ish if it's a signed variable of unknown sign depending on which set of operations are present and how many of them are 3-input or ALU+branch). But typically, the kind of code that would want this kind of checking is often something like banking code, something that's typically very not mathy, but rather very high on hard to predict load/stores and branches, so the extra math operations are likely to not even have any impact since the core is likely to get stalled waiting for memory or waiting after L1 latency or L1 throughput or a branch prediction fail.
If you're running a high level programming language, which I think is the kind of thing you'd do in an application where overflow handling is important, then it matters even less because high level interpreters have a lot of much more costly stuff going on, such as all object accesses going through hashes and multiple layers of pointers.
For moderate level programming languages, C# specifically has a checked {} block (everything inside the block is checked for overflow), which lets you only check a small portion of your application where you really need it, which minimizes the speed impact (in fact, it's highly unlikely you're going to see any difference unless you do a loop on large amounts of data inside a checked {} block).
If you're running a high level programming language, which I think is the kind of thing you'd do in an application where overflow handling is important, then it matters even less because high level interpreters have a lot of much more costly stuff going on, such as all object accesses going through hashes and multiple layers of pointers.
For moderate level programming languages, C# specifically has a checked {} block (everything inside the block is checked for overflow), which lets you only check a small portion of your application where you really need it, which minimizes the speed impact (in fact, it's highly unlikely you're going to see any difference unless you do a loop on large amounts of data inside a checked {} block).
Re: Forwardcom possible execution pipeline?
The C language is particularly bad for overflow checking. It's not safe to detect signed integer overflow after it has occurred because the compiler is allowed to optimize it away. I've seen a very nasty bug because I checked for overflow in this way. See https://codereview.stackexchange.com/qu ... t-overflow
It would be easier to check for overflow in high level languages if there is hardware support behind it. A software implementation can be quite voluminous if you have to check every step in a long sequence of calculations.
I agree that traps are costly and it is better to avoid them. You have made me think about whether it is possible to avoid speculative execution all together, but more about that later.
It would be easier to check for overflow in high level languages if there is hardware support behind it. A software implementation can be quite voluminous if you have to check every step in a long sequence of calculations.
I agree that traps are costly and it is better to avoid them. You have made me think about whether it is possible to avoid speculative execution all together, but more about that later.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Forwardcom possible execution pipeline?
The fused Add-and-conditional-jump instructions you have already have are pretty heavy duty way to essentially generate a trapping instruction, no? All you have to do is jump to a trap.
Re: Forwardcom possible execution pipeline?
Yes, but not for multiplication and division. I don't want to have ALU operations with different latencies combined with conditional jump because it will complicate the pipeline.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Forwardcom possible execution pipeline?
For multiplication, if you want to detect a wrap, there's really no other way than doing it full size and checking the top part of the result, regardless of it's done in software or hardware. It's going to be long in every case (electric delay) and the only thing changed by which instruction does it is which inner results are "exposed" in a register or not, which can save at most a couple of register renaming slots and and a couple of physical register file port cycles. And if you use pipeline clustering, these "inner values" can be sortof hidden too even in software.
Division is of secondary importance to be honest. ARM totally lacked division for the longest time, and did totally fine.
Division is of secondary importance to be honest. ARM totally lacked division for the longest time, and did totally fine.