Summary
LLVM partially optimizes the classic popcount loop
while (x != 0) {
x &= x - 1;
c++;
}by strength-reducing the loop body to blsr, but it does not recognize the
entire loop as a population count.
This is weaker than the parallel-popcount-network issue because loop idiom recognition is more debatable, but it still seems worth tracking privately as a possible canonicalization opportunity.
Reduced LLVM IR repro
target triple = "x86_64-unknown-linux-gnu"
define i32 @popcount_loop(i32 %x) {
entry:
br label %loop
loop:
%n = phi i32 [ %x, %entry ], [ %next, %loop ]
%c = phi i32 [ 0, %entry ], [ %c.next, %loop ]
%nz = icmp ne i32 %n, 0
br i1 %nz, label %body, label %exit
body:
%n1 = add i32 %n, -1
%next = and i32 %n, %n1
%c.next = add i32 %c, 1
br label %loop
exit:
ret i32 %c
}Current x86_64 output
LLVM current head improves the loop body but still keeps the loop:
xorl %eax, %eax
testl %edi, %edi
je .Lexit
.Lloop:
incl %eax
blsrl %edi, %edi
jne .Lloop
retqBetter output
With POPCNT available:
popcntl %edi, %eax
retqWithout POPCNT, a different non-loop expansion could still be profitable, depending on target and cost model.
Why this seems worth tracking
- LLVM already recognizes
x &= x - 1asblsr - the loop computes exactly the number of set bits
- it may be possible to treat this as a popcount idiom at the loop level
Caveat
This is not as strong as the direct Hacker’s Delight parallel-popcount-network issue. It is more of a “does LLVM want this loop idiom?” question than a clear missed peephole.
Local references
- Source note:
llvm-validation/reduced-ch5/bug-report-notes.md - IR repro:
llvm-validation/reduced-ch5/popcount_loop_to_ctpop.ll