forwardcom forum

HubertLamontagne

Yeah, this is std::vector::push_back(), right? This is at least 5 micro-ops even in the very best case: - Load [array size, current allocation size, pointer to data] - (second micro-op from 24byte unaligned vector load) - Increment array size and check that it's lower or equal to the allocated size,...

HubertLamontagne

I've played around with how to reduce the stalls from not knowing if instructions are going to run or not... and I've come up with "on the fly conditionalization"... for instance, the code: int32 r0 = [r1] int32 r2 = r3 + r4 You'd want to run r2=r3+r4 to run early, but you can't because th...

HubertLamontagne

The compiler can't always reorder instructions in order of priority... for instance: int32 r0 = [r1] // high priority int64 r4 += r5 // high priority int32 r0 += 9 int32 r0 *= 7 int32 r0 >>= 2 int32 r0 += 1 int32 r0 |= 32 int32 r0 >>= 1 int32 r0 += 1 int32 [r1] = r0 // high priority int32 r2 = [r3+r...

HubertLamontagne

High/Low Priority Hint One idea that I don't know if it makes any sense would be to add a hint to instructions to tell if they should run with high priority or low priority. High priority would be used for instructions whose result would influence conditional jumps and memory addressing, and low pr...

HubertLamontagne

Clearly it's best if a % b should be equal to a - (a/b)*b in all cases including error ones yes (/0, and -0x80000000/-1 in signed 32 bits), that way the CPU doesn't need a modulo instruction and can replace it with a - d*b in all cases yeah. As for n/0 returning 0 or int min/max, there's something i...

HubertLamontagne

1. Speculative memory read: I'm not totally sure that this is a net win. I think there arguments for and against: - Speculative memory reads can be partly emulated for vectorized code on paged memory architectures. As long as the first byte read in a memory page happens for real, then we know that a...

HubertLamontagne

Casey Muratori recently had a stream about what seems to be a design mistake in the very new Risc-V SIMD proposal: https://twitter.com/cmuratori/status/1538622391307251713 Basically, it can only use v0 (the first vector register) for masking, and it only uses the first bits of v0 as mask where each ...

HubertLamontagne

There's presumably something nice to be done with succint trees, as per the paper you linked. I like how they avoid having strings of pointers (which are a massive problem, speed-wise). But this doesn't imply that you should do bit addressing: instead, it's probably optimal to store the tree-structu...

HubertLamontagne

Some STM32 microcontrollers have bit-banded memory aliased regions for that purpose: https://micromouseonline.com/2010/07/14/bit-banding-in-the-stm32/ For instance, the memory byte 0x20000000 also appears as eight single bit addresses at 0x22000000 0x22000001 0x22000002 0x22000003 0x22000004 0x22000...

HubertLamontagne

I gotta admit, I haven't seen other stuff like that yet. Interesting goal, tracking the allowed range of all memory addresses and if they're heap/stack/global data, and even object encapsulation. Not quite sure what to think of it, it reminds me of 16-bit x86's protected mode FAR pointers (and the h...

HubertLamontagne

I imagine these usages could be done with an OS call but wouldn't require a full interrupt. So you'd have an intermediary performance level (privilege level change, but no full pipeline-flush-and-context-switch from an interrupt). Which I guess still makes sense since longjmp and triggering exceptio...

HubertLamontagne

Looking at the memory layout of various programs using microsoft's VMMap tool ( https://docs.microsoft.com/en-us/sysinternals/downloads/vmmap ), Windows does manage to keep some fairly large blocks of memory contiguous, often 10+mb large. Very few memory blocks are just a single 4k page, typical blo...

HubertLamontagne

One important part of how the PC became common was the standardized timers, dma controllers and interrupt controllers based on the chips IBM used in early PCs: - Intel 8237 DMA controller (x2 in PC AT) - Intel 8259 IRQ controller (x2 in PC AT) - Intel 8253/8254 timer On modern PCs these were supplem...

HubertLamontagne

I imagine that Intel's implementation is probably just a whole bunch of 8:1 multiplexers (and high fanout buffers), and I think that part of the idea is that the compiler could fold any number of operations on the same 3 inputs by changing the 8 entry lookup table.

HubertLamontagne

Cuminies wrote: ↑2021-06-26, 19:55:26 multiple versions of forwardcom cpu that have compatible base instruction sets and different extended instruction sets to satisfy hubert an agner both?

Forwardcom is 100% agner's baby, I wouldn't dare attempting to fork it :P

HubertLamontagne

One extra question here... Not to be too inquisitive here, but I was wondering what is your exact motivation for doing LOAD+MATH in a single instruction instead of a LOAD and MATH instruction sequence on Forwardcom. Is it: - Is the goal to build in-order CPU pipelines like the 1st generation Intel A...

HubertLamontagne

Can you make it a flag in the cpu? Like a mode? 32/64 bit mode. When reading some instructions it can convert them before anything else happens. I don't know anything about CPUs. Normally, 32/64/32-in-64 bits for memory addresses/pointers is a CPU flag yes, as it affects a ton of other stuff (page ...

HubertLamontagne

It's an interesting idea, for sure. Though I think it would require a 3rd cache - I don't think it can share the instruction cache: - The instruction cache doesn't do word addressing. It loads whole 128bit or 256bit aligned chunks, that are then queued into the so-called "prefetch queue" (...

HubertLamontagne

I imagine that ARM decided on specifically making 4 operand operations into a fusable two-instruction sequence in order to avoid introducing 64bit opcodes in ARM64 when everything else is 32bit only, and to have an escape chute in case nobody used SVE (or if they have so many register file write por...

HubertLamontagne

Arm v9 (basically adding Scalable Vector Extensions to the main instruction set) has an interesting new instruction called MOVPRFX: https://developer.arm.com/documentation/ddi0602/latest/SVE-Instructions/MOVPRFX--unpredicated---Move-prefix--unpredicated--?lang=en The problem that they had is that wi...

HubertLamontagne

Arm64 has a fairly extensive solution to this, so it might make sense to look it up. One solution is to use 64 bit instructions to do 32 bit math and let the top 32 bits be junk, except for specific cases (generally operations that propagate bits rightwards): - Right shift and arithmetic right shift...

HubertLamontagne

Forcing store operations to never be speculative simplifies a lot of these tasks (it kinda turns the cpu into a partially in-order CPU?) but I'm not sure I can come up with a kind of architecture that could do this without causing huge stalls... Assuming no other core is dependent on the data of a ...

HubertLamontagne

In most cases, yes. But not in all cases... You can always pass unaligned pointers to other functions... Though I guess that kind of case can always be handled with a Memory Aliasing Fault, and letting the OS do the split memory access for you and cobbling together the result... Slow, but should hap...

HubertLamontagne

Perhaps the way unaligned loads/stores could be handled is through an alignment predictor... All memory loads/stores are initially predicted to be aligned, and if a memory operation ends up being unaligned, it triggers a branch prediction fail and the load/store operation is recorded in the alignmen...

HubertLamontagne

Looking at the problem of how to build fast CPU cores, I've come to the conclusion that the key component that differentiates the big boys is to have an L1 data cache that supports rollbacking operations that haven't graduated/committed. The simpler CPUs that have this kind of L1 Cache (such as the ...

forwardcom forum

Search found 80 matches

Re: Fushed push with bounds check

Re: Proposals for next version

Re: Proposals for next version

Re: Proposals for next version

Re: Integer division by zero

Re: Proposals for next version

Casei Muratori stream about Vector lane masking

Re: Bit addressing

Re: Bit addressing

Re: Memory safety enforcement using CHERI

Re: Nonlocal control flow

Re: How to avoid memory fragmentation

Timer, DMA controller, Interrupt controller built into the architecture?

Re: Universal boolean instruction

Re: input/output instructions

Re: Macro-op fusion as an intentional instruction set design choice

Re: Default integer size 32 or 64 bits?

Re: Load From Const Array Instruction

Re: Macro-op fusion as an intentional instruction set design choice

Macro-op fusion as an intentional instruction set design choice

Re: Default integer size 32 or 64 bits?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Re: Rollbackable L1 Data Cache Design?

Rollbackable L1 Data Cache Design?