Bug: x86 backend misses PEXT for perfect bit-unshuffle — 30→4 insns with BMI2
Title
x86: 4-step perfect unshuffle not recognized as 2x PEXT — 30 instructions vs 4 with BMI2
Summary
The 4-step bit-transpose that separates interleaved even/odd bits (the “perfect
unshuffle”) is not recognized as two pext instructions even when -mattr=+bmi2
is specified. LLVM generates 30 instructions in both the base and BMI2 cases.
This pattern appears in §16-2 of Hacker’s Delight (Figure 16-8) as the final step
of the parallel prefix algorithm for converting a Hilbert curve index s to
coordinates (x, y).
Reproducer
; Perfect unshuffle: separate even-index bits (→ x) and odd-index bits (→ y).
define { i32, i32 } @unshuffle(i32 %s) {
%t1a = lshr i32 %s, 1
%t1b = xor i32 %s, %t1a
%t1c = and i32 %t1b, 572662306 ; 0x22222222
%t1d = shl i32 %t1c, 1
%t1e = xor i32 %t1c, %t1d
%s1 = xor i32 %s, %t1e
; ... (3 more steps with masks 0x0C0C0C0C, 0x00F000F0, 0x0000FF00)
%x = lshr i32 %s4, 16
%y = and i32 %s4, 65535
ret { i32, i32 } { i32 %x, i32 %y }
}# Both generate 30 instructions:
llc -O2 -mtriple=x86_64-pc-linux-gnu
llc -O2 -mtriple=x86_64-pc-linux-gnu -mattr=+bmi2Current output (+bmi2, 30 instructions)
movl %edi, %eax
shrl %eax
xorl %edi, %eax
andl $572662306, %eax ; 0x22222222
leal (%rax,%rax,2), %eax ; t * 3 = t ^ (t<<1) (non-overlapping bits)
xorl %edi, %eax
; ... 24 more instructions for steps 2-4 + final extraction
retqExpected output (+bmi2, 4 instructions)
movl $1431655765, %eax ; 0x55555555
pextl %eax, %edi, %eax ; x = extract bits at positions 0,2,4,...,30
movl $-1431655766, %ecx ; 0xAAAAAAAA
pextl %ecx, %edi, %edx ; y = extract bits at positions 1,3,5,...,31
retq(Verified: explicit @llvm.x86.bmi.pext.32 intrinsics produce exactly this output.)
Why they are equivalent
The 4-step bit-transpose (§16-2 Figure 16-8) rearranges a 32-bit word s so that
the 16 even-index bits of s end up contiguous in the upper 16 bits, and the 16
odd-index bits end up contiguous in the lower 16 bits.
PEXT r32, r/m32, r32 (parallel bit extract, BMI2) extracts bits from the source
at positions indicated by the mask and packs them contiguously into the destination:
pext(s, 0x55555555): mask has 1s at positions 0,2,4,…,30 → extracts 16 bits from the even positions ofsand packs them into bits 0–15 of the result =x.pext(s, 0xAAAAAAAA): mask has 1s at positions 1,3,5,…,31 → extracts 16 bits from the odd positions ofsand packs them into bits 0–15 of the result =y.
This is exactly the output of the 4-step unshuffle (modulo upper/lower swap).
Root cause
The x86 backend (or DAGCombine) does not recognize the 4-step
(s ^ (s >> k)) & C; s ^= t ^ (t << k) pattern as a perfect unshuffle and does
not lower it to PEXT when BMI2 is available.
The pattern to match (for i32):
s1 = s ^ ((s ^ (s>>1 )) & 0x22222222) ^ (((s ^ (s>>1 )) & 0x22222222) << 1)
s2 = s1 ^ ((s1^ (s1>>2)) & 0x0C0C0C0C) ^ (((s1^ (s1>>2)) & 0x0C0C0C0C) << 2)
s3 = s2 ^ ((s2^ (s2>>4)) & 0x00F000F0) ^ (((s2^ (s2>>4)) & 0x00F000F0) << 4)
s4 = s3 ^ ((s3^ (s3>>8)) & 0x0000FF00) ^ (((s3^ (s3>>8)) & 0x0000FF00) << 8)
x = s4 >> 16
y = s4 & 0xFFFF
→ x = pext(s, 0x55555555); y = pext(s, 0xAAAAAAAA)
Inverse: perfect shuffle (PDEP opportunity)
The inverse operation (interleaving x and y into s) maps to:
s = pdep(x, 0x55555555) | pdep(y, 0xAAAAAAAA)5 instructions with BMI2. The 4-step shuffle sequence has a similar opportunity.
Where to fix
llvm/lib/Target/X86/X86ISelLowering.cpp or a DAGCombine in
X86ISelDAGToDAG.cpp: add a pattern that recognizes the perfect-unshuffle
sequence and lowers to PEXT when +bmi2 is available.
Alternatively, a generic InstCombine pass in
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp could canonicalize the
4-step transpose to @llvm.x86.bmi.pext.32 (or a target-independent
@llvm.bitreverse-style intrinsic) before instruction selection.