Bug: x86 backend misses PEXT for perfect bit-unshuffle — 30→4 insns with BMI2

Title

x86: 4-step perfect unshuffle not recognized as 2x PEXT — 30 instructions vs 4 with BMI2

Summary

The 4-step bit-transpose that separates interleaved even/odd bits (the “perfect unshuffle”) is not recognized as two pext instructions even when -mattr=+bmi2 is specified. LLVM generates 30 instructions in both the base and BMI2 cases.

This pattern appears in §16-2 of Hacker’s Delight (Figure 16-8) as the final step of the parallel prefix algorithm for converting a Hilbert curve index s to coordinates (x, y).

Reproducer

; Perfect unshuffle: separate even-index bits (→ x) and odd-index bits (→ y).
define { i32, i32 } @unshuffle(i32 %s) {
  %t1a = lshr i32 %s,  1
  %t1b = xor  i32 %s,  %t1a
  %t1c = and  i32 %t1b, 572662306    ; 0x22222222
  %t1d = shl  i32 %t1c, 1
  %t1e = xor  i32 %t1c, %t1d
  %s1  = xor  i32 %s,  %t1e
  ; ... (3 more steps with masks 0x0C0C0C0C, 0x00F000F0, 0x0000FF00)
  %x  = lshr i32 %s4, 16
  %y  = and  i32 %s4, 65535
  ret { i32, i32 } { i32 %x, i32 %y }
}
# Both generate 30 instructions:
llc -O2 -mtriple=x86_64-pc-linux-gnu
llc -O2 -mtriple=x86_64-pc-linux-gnu -mattr=+bmi2

Current output (+bmi2, 30 instructions)

movl  %edi, %eax
shrl  %eax
xorl  %edi, %eax
andl  $572662306, %eax      ; 0x22222222
leal  (%rax,%rax,2), %eax   ; t * 3 = t ^ (t<<1) (non-overlapping bits)
xorl  %edi, %eax
; ... 24 more instructions for steps 2-4 + final extraction
retq

Expected output (+bmi2, 4 instructions)

movl  $1431655765, %eax      ; 0x55555555
pextl %eax, %edi, %eax       ; x = extract bits at positions 0,2,4,...,30
movl  $-1431655766, %ecx     ; 0xAAAAAAAA
pextl %ecx, %edi, %edx       ; y = extract bits at positions 1,3,5,...,31
retq

(Verified: explicit @llvm.x86.bmi.pext.32 intrinsics produce exactly this output.)

Why they are equivalent

The 4-step bit-transpose (§16-2 Figure 16-8) rearranges a 32-bit word s so that the 16 even-index bits of s end up contiguous in the upper 16 bits, and the 16 odd-index bits end up contiguous in the lower 16 bits.

PEXT r32, r/m32, r32 (parallel bit extract, BMI2) extracts bits from the source at positions indicated by the mask and packs them contiguously into the destination:

  • pext(s, 0x55555555): mask has 1s at positions 0,2,4,…,30 → extracts 16 bits from the even positions of s and packs them into bits 0–15 of the result = x.
  • pext(s, 0xAAAAAAAA): mask has 1s at positions 1,3,5,…,31 → extracts 16 bits from the odd positions of s and packs them into bits 0–15 of the result = y.

This is exactly the output of the 4-step unshuffle (modulo upper/lower swap).

Root cause

The x86 backend (or DAGCombine) does not recognize the 4-step (s ^ (s >> k)) & C; s ^= t ^ (t << k) pattern as a perfect unshuffle and does not lower it to PEXT when BMI2 is available.

The pattern to match (for i32):

s1 = s  ^ ((s ^ (s>>1 )) & 0x22222222) ^ (((s ^ (s>>1 )) & 0x22222222) << 1)
s2 = s1 ^ ((s1^ (s1>>2)) & 0x0C0C0C0C) ^ (((s1^ (s1>>2)) & 0x0C0C0C0C) << 2)
s3 = s2 ^ ((s2^ (s2>>4)) & 0x00F000F0) ^ (((s2^ (s2>>4)) & 0x00F000F0) << 4)
s4 = s3 ^ ((s3^ (s3>>8)) & 0x0000FF00) ^ (((s3^ (s3>>8)) & 0x0000FF00) << 8)
x  = s4 >> 16
y  = s4 & 0xFFFF

x = pext(s, 0x55555555); y = pext(s, 0xAAAAAAAA)

Inverse: perfect shuffle (PDEP opportunity)

The inverse operation (interleaving x and y into s) maps to:

s = pdep(x, 0x55555555) | pdep(y, 0xAAAAAAAA)

5 instructions with BMI2. The 4-step shuffle sequence has a similar opportunity.

Where to fix

llvm/lib/Target/X86/X86ISelLowering.cpp or a DAGCombine in X86ISelDAGToDAG.cpp: add a pattern that recognizes the perfect-unshuffle sequence and lowers to PEXT when +bmi2 is available.

Alternatively, a generic InstCombine pass in llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp could canonicalize the 4-step transpose to @llvm.x86.bmi.pext.32 (or a target-independent @llvm.bitreverse-style intrinsic) before instruction selection.