Fushed push with bounds check
Moderator: agner
Fushed push with bounds check
I don't know if it makes sense at the hardware level, but the common dynamic array push operation wants to do a bounds check either immediately before or after the write.
Re: Fushed push with bounds check
The addressing mode with bounds check works only if the array size is fixed and known at compile time.
The push instruction will signal an error if writing to inaccessible memory.
The push instruction is already quite complex at the hardware level. Adding still more complexity might be critical. It would also need an extra operand for the stack limit.
I think it is easier to just insert a compare instruction before or after the push. The compare instruction can jump to an error branch which is more efficient than an error trap.
The push instruction will signal an error if writing to inaccessible memory.
The push instruction is already quite complex at the hardware level. Adding still more complexity might be critical. It would also need an extra operand for the stack limit.
I think it is easier to just insert a compare instruction before or after the push. The compare instruction can jump to an error branch which is more efficient than an error trap.
-
- Posts: 80
- Joined: 2017-11-17, 21:39:51
Re: Fushed push with bounds check
Yeah, this is std::vector::push_back(), right? This is at least 5 micro-ops even in the very best case:
- Load [array size, current allocation size, pointer to data]
- (second micro-op from 24byte unaligned vector load)
- Increment array size and check that it's lower or equal to the allocated size, else jump to an allocation routine
- Write [new array size]
- Write [data] at [pointer + old array size]
Compare this to 7 micro-ops in a more conventional architecture:
- Load [array size]
- Load [current allocation size]
- Load [data pointer]
- Increment array size
- Compare and jump to allocation routine if larger
- Write [new array size]
- Write [data] at [pointer + old array size]
Here's example of what a C++ compiler outputs for this: https://godbolt.org/z/vYhMYbeWE
Essentially your savings would come from 2 sources:
- Squeezing the 3 initial loads into 2 loads: That's a lot of lifting to save 1 contiguous memory load in my opinion. I think conventional architectures have better schemes to deal with this, such as adding more data cache load ports, data cache banking (so that the loads come from different banks), and contiguous multi-register load instructions such as ARM64's Load Pair instruction (LDP).
- Combining the array size increment and the compare-and-jump together: That's actually not crazy. Incrementing a register by a small immediate and comparing to another register and jumping is a common sequence of operations, you see it often on loop ends. It would still be a kinda large instruction but it has many properties that make it OK for the typical pipelines: it only reads 2 registers, it only writes 1 register, it only does 1 conditional jump and always to the same address, it doesn't load or save from memory or cause traps. It's also OK for compilers (they can simply combine the contiguous ADD and the COMPARE-AND-BRANCH-IF-LARGER).
- Load [array size, current allocation size, pointer to data]
- (second micro-op from 24byte unaligned vector load)
- Increment array size and check that it's lower or equal to the allocated size, else jump to an allocation routine
- Write [new array size]
- Write [data] at [pointer + old array size]
Compare this to 7 micro-ops in a more conventional architecture:
- Load [array size]
- Load [current allocation size]
- Load [data pointer]
- Increment array size
- Compare and jump to allocation routine if larger
- Write [new array size]
- Write [data] at [pointer + old array size]
Here's example of what a C++ compiler outputs for this: https://godbolt.org/z/vYhMYbeWE
Essentially your savings would come from 2 sources:
- Squeezing the 3 initial loads into 2 loads: That's a lot of lifting to save 1 contiguous memory load in my opinion. I think conventional architectures have better schemes to deal with this, such as adding more data cache load ports, data cache banking (so that the loads come from different banks), and contiguous multi-register load instructions such as ARM64's Load Pair instruction (LDP).
- Combining the array size increment and the compare-and-jump together: That's actually not crazy. Incrementing a register by a small immediate and comparing to another register and jumping is a common sequence of operations, you see it often on loop ends. It would still be a kinda large instruction but it has many properties that make it OK for the typical pipelines: it only reads 2 registers, it only writes 1 register, it only does 1 conditional jump and always to the same address, it doesn't load or save from memory or cause traps. It's also OK for compilers (they can simply combine the contiguous ADD and the COMPARE-AND-BRANCH-IF-LARGER).
Re: Fushed push with bounds check
Nice
It would also allow pushing vector registers without needing to carry around their size separately
It would also allow pushing vector registers without needing to carry around their size separately
Re: Fused push with bounds check
Is pop with length similarly feasible in hardware?
It would be more convenient than requesting hardware vector lengths and setting up backwards indexing
pop v r length vecptr
len := max hardwareveclen length
v := vecptr[0 .. len - 1]
r := vecptr + len
It would be more convenient than requesting hardware vector lengths and setting up backwards indexing
pop v r length vecptr
len := max hardwareveclen length
v := vecptr[0 .. len - 1]
r := vecptr + len
Re: Fushed push with bounds check
A vector register in ForwardCom always contains a length and a sequence of data with that length.
HubertLamontagne wrote:
The main stack is intended for temporary storage in a function. push/pop will save and restore a register. If it is a vector register, you need to store both the length and the contents in order to restore an identical vector. The push instruction takes care of this. The memory image of a vector saved with the push instruction is implementation specific. The hardware is allowed to store the length and data in any useful way as long as it can be recovered with a pop instruction on the same machine.
What you are requesting, as I understand it, is to make a dynamic array in memory. You can do this by making a second stack somewhere in memory. The push and pop instructions will work for general purpose registers here, but not for vector registers. You should use normal store and move instructions for putting data from vector registers into your dynamic array if you want to store the value but not the length of the vector register. The vector loop functionality may be handy here.
ForwardCom does not like microcode. Microcode on contemporary processors is quite slow, and my intention is to avoid it completely. The push and pop instructions are currently implemented with a state machine in the decoder rather than a microcode ROM. The complexity of instructions is therefore limited. The general rule is that instructions should have no more complexity than what fits into the standard pipeline design. I have allowed the push and pop instructions to be an exception to this rule because they are so important and widely used.
HubertLamontagne wrote:
Not really. The stack is not equivalent to std::vector, but you can make multiple stacks and use one of them as a std::vector. This requires a start point, a limit, and a stack pointer.this is std::vector::push_back(), right?
The main stack is intended for temporary storage in a function. push/pop will save and restore a register. If it is a vector register, you need to store both the length and the contents in order to restore an identical vector. The push instruction takes care of this. The memory image of a vector saved with the push instruction is implementation specific. The hardware is allowed to store the length and data in any useful way as long as it can be recovered with a pop instruction on the same machine.
What you are requesting, as I understand it, is to make a dynamic array in memory. You can do this by making a second stack somewhere in memory. The push and pop instructions will work for general purpose registers here, but not for vector registers. You should use normal store and move instructions for putting data from vector registers into your dynamic array if you want to store the value but not the length of the vector register. The vector loop functionality may be handy here.
ForwardCom does not like microcode. Microcode on contemporary processors is quite slow, and my intention is to avoid it completely. The push and pop instructions are currently implemented with a state machine in the decoder rather than a microcode ROM. The complexity of instructions is therefore limited. The general rule is that instructions should have no more complexity than what fits into the standard pipeline design. I have allowed the push and pop instructions to be an exception to this rule because they are so important and widely used.