issue

Summary

LLVM does not recognize the standard parallel-prefix XOR parity reduction as ctpop(x) & 1.

As a result, x86 codegen emits the full 5-step XOR chain even when POPCNT is available. If the value is written directly as ctpop(x) & 1, LLVM already lowers it well.

This looks like a strong InstCombine / canonicalization miss.

Reproducer

define i32 @parity_prefix_xor(i32 %x) {
  %a = lshr i32 %x, 1
  %b = xor i32 %x, %a
  %c = lshr i32 %b, 2
  %d = xor i32 %b, %c
  %e = lshr i32 %d, 4
  %f = xor i32 %d, %e
  %g = lshr i32 %f, 8
  %h = xor i32 %f, %g
  %i = lshr i32 %h, 16
  %j = xor i32 %h, %i
  %r = and i32 %j, 1
  ret i32 %r
}

Current output

With +popcnt, LLVM still emits the full XOR chain:

movl  %edi, %eax
shrl  %eax
xorl  %edi, %eax
movl  %eax, %ecx
shrl  $2, %ecx
xorl  %eax, %ecx
movl  %ecx, %eax
shrl  $4, %eax
xorl  %ecx, %eax
movl  %eax, %ecx
shrl  $8, %ecx
xorl  %eax, %ecx
movl  %ecx, %eax
shrl  $16, %eax
xorl  %ecx, %eax
andl  $1, %eax
retq

Better outputs

With POPCNT:

popcntl  %edi, %eax
andl     $1, %eax
retq

Without POPCNT, LLVM already knows a better parity lowering when starting from ctpop(x) & 1:

movl   %edi, %ecx
shrl   $16, %ecx
xorl   %edi, %ecx
xorl   %eax, %eax
xorb   %ch, %cl
setnp  %al
retq

Why this is valid

The shift/XOR chain is a parallel prefix reduction. After the final step, bit 0 contains the XOR of all bits of x, which is exactly:

parity(x) = popcount(x) mod 2 = ctpop(x) & 1

So this is a straight canonicalization opportunity.

Why it seems worth tracking

there is already a better lowering path in LLVM for ctpop & 1
the missed optimization is entirely in the recognizer
this pattern shows up in ECC/Hamming code kernels and parity routines

Likely fix area

llvm/lib/Transforms/InstCombine/InstCombineAndOrXor.cpp

Local references

Source note: llvm-validation/ch15-ecc/bug-parity-prefix-xor.md
Harness: llvm-validation/ch15-ecc/ch15_ecc.ll

Takashi's Notes

Explorer