Bug: InstCombine misses parallel prefix XOR reduction as parity (popcount & 1)
Title
InstCombine: parallel prefix XOR reduction not recognized as ctpop(x) & 1 — 16 insns, optimal 2 (+popcnt) or 6 (no popcnt)
Summary
The standard software idiom for computing word parity is a 5-step parallel prefix
XOR reduction. LLVM does not recognize this as ctpop(x) & 1 and generates 16
instructions even when the target has POPCNT. The ctpop intrinsic itself is
lowered correctly (2 insns with POPCNT, 6 insns without via SETNP). The miss is
purely in the pattern recognizer — opt does not lift the XOR chain to ctpop.
This idiom appears in §15-3 of Hacker’s Delight (Figure 15-1) as the check-bit computation kernel for Hamming SEC-DED codes. Each of the six check bits is computed as the parity of a masked subset of the 32 information bits, using this exact XOR-prefix pattern.
Reproducer
define i32 @parity_prefix_xor(i32 %x) {
%a = lshr i32 %x, 1
%b = xor i32 %x, %a
%c = lshr i32 %b, 2
%d = xor i32 %b, %c
%e = lshr i32 %d, 4
%f = xor i32 %d, %e
%g = lshr i32 %f, 8
%h = xor i32 %f, %g
%i = lshr i32 %h, 16
%j = xor i32 %h, %i
%r = and i32 %j, 1
ret i32 %r
}# Both commands produce 16 instructions:
llc -O2 -mtriple=x86_64-pc-linux-gnu
llc -O2 -mtriple=x86_64-pc-linux-gnu -mattr=+popcntCurrent output (+popcnt, 16 instructions)
movl %edi, %eax
shrl %eax
xorl %edi, %eax
movl %eax, %ecx
shrl $2, %ecx
xorl %eax, %ecx
movl %ecx, %eax
shrl $4, %eax
xorl %ecx, %eax
movl %eax, %ecx
shrl $8, %ecx
xorl %eax, %ecx
movl %ecx, %eax
shrl $16, %eax
xorl %ecx, %eax
andl $1, %eax
retqExpected output (+popcnt, 2 instructions)
popcntl %edi, %eax
andl $1, %eax
retqExpected output (no popcnt, 6 instructions)
movl %edi, %ecx
shrl $16, %ecx
xorl %edi, %ecx
xorl %eax, %eax
xorb %ch, %cl
setnp %al
retq(This is exactly what llc generates when @llvm.ctpop.i32 is used directly.)
Why the XOR chain equals ctpop(x) & 1
The 5-step shift-XOR sequence is a parallel prefix XOR reduction. Each step folds
the high half into the low half at every width boundary. After all 5 steps, bit 0
holds XOR of all 32 bits. The XOR of all bits equals popcount(x) mod 2 = parity(x).
Formally: (x ^ (x>>1) ^ (x>>2) ^ ... ^ (x>>31)) & 1 = popcount(x) & 1.
The prefix XOR formulation computes this in O(log n) XORs rather than n-1 by sharing intermediate values.
Proof via ctpop intrinsic
When written with the ctpop intrinsic directly, llc generates the correct output:
declare i32 @llvm.ctpop.i32(i32)
define i32 @parity_via_ctpop(i32 %x) {
%pc = call i32 @llvm.ctpop.i32(i32 %x)
%r = and i32 %pc, 1
ret i32 %r
}→ 2 instructions with +popcnt, 6 instructions without. The miss is in opt,
which does not canonicalize the XOR chain to ctpop.
Masked variant (Hamming check bits)
The same miss applies to the masked version:
define i32 @parity_p0(i32 %u) {
%m = and i32 %u, -1431655766 ; 0xAAAAAAAA — bits 1,3,5,...,31
; ... same 5-step XOR chain on %m ...
%r = and i32 %j, 1
ret i32 %r
}Optimal with +popcnt: 3 instructions (andl $mask; popcntl; andl $1).
Current: 16 instructions.
Severity
Moderate. The 16-instruction XOR chain appears in every Hamming check-bit calculation and every software parity routine written in this idiom. With POPCNT available (all x86_64 CPUs since ~2008), this is an 8× regression in instruction count. Without POPCNT, the miss is still 16 vs 6 instructions (a 2.7× regression).
Where to fix
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp or a dedicated pattern
in InstCombineSelect.cpp: recognize the full XOR-prefix-reduction chain:
(XOR (XOR (XOR (XOR x (SRL x, 1)) (SHR ... 2)) (SHR ... 4)) ...) & 1
→ (CTPOP x) & 1
The pattern for i32 requires all five shifts (1, 2, 4, 8, 16) to be present; for i16 four shifts (1, 2, 4, 8); for i64 six shifts (1, 2, 4, 8, 16, 32).