Yeah, this is std::vector::push_back(), right? This is at least 5 micro-ops even in the very best case:
- Load [array size, current allocation size, pointer to data]
- (second micro-op from 24byte unaligned vector load)
- Increment array size and check that it's lower or equal to the allocated size, else jump to an allocation routine
- Write [new array size]
- Write [data] at [pointer + old array size]
Compare this to 7 micro-ops in a more conventional architecture:
- Load [array size]
- Load [current allocation size]
- Load [data pointer]
- Increment array size
- Compare and jump to allocation routine if larger
- Write [new array size]
- Write [data] at [pointer + old array size]
Here's example of what a C++ compiler outputs for this:
https://godbolt.org/z/vYhMYbeWE
Essentially your savings would come from 2 sources:
- Squeezing the 3 initial loads into 2 loads: That's a lot of lifting to save 1 contiguous memory load in my opinion. I think conventional architectures have better schemes to deal with this, such as adding more data cache load ports, data cache banking (so that the loads come from different banks), and contiguous multi-register load instructions such as ARM64's Load Pair instruction (LDP).
- Combining the array size increment and the compare-and-jump together: That's actually not crazy. Incrementing a register by a small immediate and comparing to another register and jumping is a common sequence of operations, you see it often on loop ends. It would still be a kinda large instruction but it has many properties that make it OK for the typical pipelines: it only reads 2 registers, it only writes 1 register, it only does 1 conditional jump and always to the same address, it doesn't load or save from memory or cause traps. It's also OK for compilers (they can simply combine the contiguous ADD and the COMPARE-AND-BRANCH-IF-LARGER).