shifted truth_tab3 (vpternlog)

James · Post by **James** » 2025-01-09, 7:50:44

Motivation:
Sequences like "(x & 0x4) | ((x & 0x2) << 2) | ((x & 0x80) >> 7)" Can frequently occur in high performance code for permuting bits, eg. rearranging masks, encoding x86 instructions, allocating from a bitset with a permutation, FFT without bitrev etc..

Idea:
truth_tab's immediate can be extended from 8bit to 32-bits to encode operand shifts.
Since the hardware can use those as offsets into the bitstrings and avoid even performing the shift, I expect this to fit in a single cycle and save up to 3 shift instructions.

Considering variations of this theme:
On 64-bit operands shift counts require 6bits, we can use the full 8 if we encode shr,sar[rol etc.?] in there too.

James · Post by **James** » 2025-01-09, 15:54:29

On the topic of GF(2) operations, we're missing 2 (that I use frequently), these are relevant in cryptography and hashing but have some other creative applications:

cmlul (carryless multiply):
eg1. prefix xor scan = clmul by ~0, a popular primitive in APL, and in simdjson used for branchless lexing where it converts quote pairs into quote masks, The point being that the parser only branches on 'ctlz's of the lexeme bitstring, even getting lexeme length for free for further branchless parsing (numbers,name hashing). There the prefix xor masks out quoted lexemes.
eg2. clmul x x does bit interleaving: 0000000000000001111111111111 –> 0001010101010101010101010101

gf2p8affineqb and inverse (matrix multiplication of an 8x8 bit matrix by an 8-bit vector in GF(2)):
The king of byte granularity bit manipulations: matrices can rotate, reverse, permute, shift, swizzle...
eg. Can do the x86 missing epi8 shift functions (like srli_epi8).

It's not clear how relevant some of the other intel cryptography intrinsics are but I believe GF(2) operations should be primitives, especially clmul.

forwardcom forum

shifted truth_tab3 (vpternlog)

shifted truth_tab3 (vpternlog)

Re: shifted truth_tab3 (vpternlog)