Summary

LLVM partially optimizes the classic popcount loop

while (x != 0) {
  x &= x - 1;
  c++;
}

by strength-reducing the loop body to blsr, but it does not recognize the entire loop as a population count.

This is weaker than the parallel-popcount-network issue because loop idiom recognition is more debatable, but it still seems worth tracking privately as a possible canonicalization opportunity.

Reduced LLVM IR repro

target triple = "x86_64-unknown-linux-gnu"
 
define i32 @popcount_loop(i32 %x) {
entry:
  br label %loop
 
loop:
  %n = phi i32 [ %x, %entry ], [ %next, %loop ]
  %c = phi i32 [ 0, %entry ], [ %c.next, %loop ]
  %nz = icmp ne i32 %n, 0
  br i1 %nz, label %body, label %exit
 
body:
  %n1 = add i32 %n, -1
  %next = and i32 %n, %n1
  %c.next = add i32 %c, 1
  br label %loop
 
exit:
  ret i32 %c
}

Current x86_64 output

LLVM current head improves the loop body but still keeps the loop:

xorl  %eax, %eax
testl %edi, %edi
je    .Lexit
.Lloop:
  incl  %eax
  blsrl %edi, %edi
  jne   .Lloop
retq

Better output

With POPCNT available:

popcntl %edi, %eax
retq

Without POPCNT, a different non-loop expansion could still be profitable, depending on target and cost model.

Why this seems worth tracking

  • LLVM already recognizes x &= x - 1 as blsr
  • the loop computes exactly the number of set bits
  • it may be possible to treat this as a popcount idiom at the loop level

Caveat

This is not as strong as the direct Hacker’s Delight parallel-popcount-network issue. It is more of a “does LLVM want this loop idiom?” question than a clear missed peephole.

Local references

  • Source note: llvm-validation/reduced-ch5/bug-report-notes.md
  • IR repro: llvm-validation/reduced-ch5/popcount_loop_to_ctpop.ll