Summary
LLVM does not recognize the standard parallel-prefix XOR parity reduction as
ctpop(x) & 1.
As a result, x86 codegen emits the full 5-step XOR chain even when POPCNT is
available. If the value is written directly as ctpop(x) & 1, LLVM already
lowers it well.
This looks like a strong InstCombine / canonicalization miss.
Reproducer
define i32 @parity_prefix_xor(i32 %x) {
%a = lshr i32 %x, 1
%b = xor i32 %x, %a
%c = lshr i32 %b, 2
%d = xor i32 %b, %c
%e = lshr i32 %d, 4
%f = xor i32 %d, %e
%g = lshr i32 %f, 8
%h = xor i32 %f, %g
%i = lshr i32 %h, 16
%j = xor i32 %h, %i
%r = and i32 %j, 1
ret i32 %r
}Current output
With +popcnt, LLVM still emits the full XOR chain:
movl %edi, %eax
shrl %eax
xorl %edi, %eax
movl %eax, %ecx
shrl $2, %ecx
xorl %eax, %ecx
movl %ecx, %eax
shrl $4, %eax
xorl %ecx, %eax
movl %eax, %ecx
shrl $8, %ecx
xorl %eax, %ecx
movl %ecx, %eax
shrl $16, %eax
xorl %ecx, %eax
andl $1, %eax
retqBetter outputs
With POPCNT:
popcntl %edi, %eax
andl $1, %eax
retqWithout POPCNT, LLVM already knows a better parity lowering when starting from
ctpop(x) & 1:
movl %edi, %ecx
shrl $16, %ecx
xorl %edi, %ecx
xorl %eax, %eax
xorb %ch, %cl
setnp %al
retqWhy this is valid
The shift/XOR chain is a parallel prefix reduction. After the final step, bit 0
contains the XOR of all bits of x, which is exactly:
parity(x) = popcount(x) mod 2 = ctpop(x) & 1So this is a straight canonicalization opportunity.
Why it seems worth tracking
- there is already a better lowering path in LLVM for
ctpop & 1 - the missed optimization is entirely in the recognizer
- this pattern shows up in ECC/Hamming code kernels and parity routines
Likely fix area
llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
Local references
- Source note:
llvm-validation/ch15-ecc/bug-parity-prefix-xor.md - Harness:
llvm-validation/ch15-ecc/ch15_ecc.ll