shifted truth_tab3 (vpternlog)
Posted: 2025-01-09, 7:50:44
Motivation:
Sequences like "(x & 0x4) | ((x & 0x2) << 2) | ((x & 0x80) >> 7)" Can frequently occur in high performance code for permuting bits, eg. rearranging masks, encoding x86 instructions, allocating from a bitset with a permutation, FFT without bitrev etc..
Idea:
truth_tab's immediate can be extended from 8bit to 32-bits to encode operand shifts.
Since the hardware can use those as offsets into the bitstrings and avoid even performing the shift, I expect this to fit in a single cycle and save up to 3 shift instructions.
Considering variations of this theme:
On 64-bit operands shift counts require 6bits, we can use the full 8 if we encode shr,sar[rol etc.?] in there too.
Sequences like "(x & 0x4) | ((x & 0x2) << 2) | ((x & 0x80) >> 7)" Can frequently occur in high performance code for permuting bits, eg. rearranging masks, encoding x86 instructions, allocating from a bitset with a permutation, FFT without bitrev etc..
Idea:
truth_tab's immediate can be extended from 8bit to 32-bits to encode operand shifts.
Since the hardware can use those as offsets into the bitstrings and avoid even performing the shift, I expect this to fit in a single cycle and save up to 3 shift instructions.
Considering variations of this theme:
On 64-bit operands shift counts require 6bits, we can use the full 8 if we encode shr,sar[rol etc.?] in there too.