Bug: InstCombine misses parallel prefix XOR reduction as parity (popcount & 1)

Title

InstCombine: parallel prefix XOR reduction not recognized as ctpop(x) & 1 — 16 insns, optimal 2 (+popcnt) or 6 (no popcnt)

Summary

The standard software idiom for computing word parity is a 5-step parallel prefix XOR reduction. LLVM does not recognize this as ctpop(x) & 1 and generates 16 instructions even when the target has POPCNT. The ctpop intrinsic itself is lowered correctly (2 insns with POPCNT, 6 insns without via SETNP). The miss is purely in the pattern recognizer — opt does not lift the XOR chain to ctpop.

This idiom appears in §15-3 of Hacker’s Delight (Figure 15-1) as the check-bit computation kernel for Hamming SEC-DED codes. Each of the six check bits is computed as the parity of a masked subset of the 32 information bits, using this exact XOR-prefix pattern.

Reproducer

define i32 @parity_prefix_xor(i32 %x) {
  %a = lshr i32 %x, 1
  %b = xor i32 %x, %a
  %c = lshr i32 %b, 2
  %d = xor i32 %b, %c
  %e = lshr i32 %d, 4
  %f = xor i32 %d, %e
  %g = lshr i32 %f, 8
  %h = xor i32 %f, %g
  %i = lshr i32 %h, 16
  %j = xor i32 %h, %i
  %r = and i32 %j, 1
  ret i32 %r
}
# Both commands produce 16 instructions:
llc -O2 -mtriple=x86_64-pc-linux-gnu
llc -O2 -mtriple=x86_64-pc-linux-gnu -mattr=+popcnt

Current output (+popcnt, 16 instructions)

movl  %edi, %eax
shrl  %eax
xorl  %edi, %eax
movl  %eax, %ecx
shrl  $2, %ecx
xorl  %eax, %ecx
movl  %ecx, %eax
shrl  $4, %eax
xorl  %ecx, %eax
movl  %eax, %ecx
shrl  $8, %ecx
xorl  %eax, %ecx
movl  %ecx, %eax
shrl  $16, %eax
xorl  %ecx, %eax
andl  $1, %eax
retq

Expected output (+popcnt, 2 instructions)

popcntl  %edi, %eax
andl     $1, %eax
retq

Expected output (no popcnt, 6 instructions)

movl   %edi, %ecx
shrl   $16, %ecx
xorl   %edi, %ecx
xorl   %eax, %eax
xorb   %ch, %cl
setnp  %al
retq

(This is exactly what llc generates when @llvm.ctpop.i32 is used directly.)

Why the XOR chain equals ctpop(x) & 1

The 5-step shift-XOR sequence is a parallel prefix XOR reduction. Each step folds the high half into the low half at every width boundary. After all 5 steps, bit 0 holds XOR of all 32 bits. The XOR of all bits equals popcount(x) mod 2 = parity(x).

Formally: (x ^ (x>>1) ^ (x>>2) ^ ... ^ (x>>31)) & 1 = popcount(x) & 1.

The prefix XOR formulation computes this in O(log n) XORs rather than n-1 by sharing intermediate values.

Proof via ctpop intrinsic

When written with the ctpop intrinsic directly, llc generates the correct output:

declare i32 @llvm.ctpop.i32(i32)
define i32 @parity_via_ctpop(i32 %x) {
  %pc = call i32 @llvm.ctpop.i32(i32 %x)
  %r  = and i32 %pc, 1
  ret i32 %r
}

→ 2 instructions with +popcnt, 6 instructions without. The miss is in opt, which does not canonicalize the XOR chain to ctpop.

Masked variant (Hamming check bits)

The same miss applies to the masked version:

define i32 @parity_p0(i32 %u) {
  %m = and i32 %u, -1431655766   ; 0xAAAAAAAA — bits 1,3,5,...,31
  ; ... same 5-step XOR chain on %m ...
  %r = and i32 %j, 1
  ret i32 %r
}

Optimal with +popcnt: 3 instructions (andl $mask; popcntl; andl $1). Current: 16 instructions.

Severity

Moderate. The 16-instruction XOR chain appears in every Hamming check-bit calculation and every software parity routine written in this idiom. With POPCNT available (all x86_64 CPUs since ~2008), this is an 8× regression in instruction count. Without POPCNT, the miss is still 16 vs 6 instructions (a 2.7× regression).

Where to fix

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp or a dedicated pattern in InstCombineSelect.cpp: recognize the full XOR-prefix-reduction chain:

(XOR (XOR (XOR (XOR x (SRL x, 1)) (SHR ... 2)) (SHR ... 4)) ...) & 1

(CTPOP x) & 1

The pattern for i32 requires all five shifts (1, 2, 4, 8, 16) to be present; for i16 four shifts (1, 2, 4, 8); for i64 six shifts (1, 2, 4, 8, 16, 32).