issue

Summary

Without BMI1, x86_64 codegen for the classic lowest-zero-bit mask idiom

~B & (B + 1)

uses 4 instructions where 3 should suffice.

This is a small backend/register-allocation quality issue. With BMI1 enabled, LLVM already emits the optimal andn-based form.

Reduced LLVM IR repro

target triple = "x86_64-pc-linux-gnu"
 
define i32 @lowest_zero_bit_mask(i32 %B) {
  %nb  = xor i32 %B, -1
  %bp1 = add i32 %B, 1
  %r   = and i32 %nb, %bp1
  ret i32 %r
}

Current output (no BMI1)

leal  1(%rdi), %eax
movl  %edi, %ecx
notl  %ecx
andl  %ecx, %eax
retq

Better output (no BMI1)

leal  1(%rdi), %eax
notl  %edi
andl  %edi, %eax
retq

Why this is valid

leal 1(%rdi), %eax reads %rdi and writes %eax. After that instruction, clobbering %edi with notl %edi is fine because the original value of %rdi is no longer needed. So the extra copy to %ecx is avoidable.

With BMI1

LLVM already emits the good 2-instruction form:

leal    1(%rdi), %eax
andnl   %eax, %edi, %eax

So this only matters for the no-BMI1 path.

Why this seems worth tracking

the optimization is small but real
it appears in Gray-code increment code
the better sequence is straightforward and local

Caveat

This is a minor 4-to-3 instruction improvement, so it is weaker than most of the other issues in the project.

Local references

Source note: llvm-validation/ch13-gray-code/README.md
IR repro: llvm-validation/ch13-gray-code/ch13_blsi_miss.ll

Takashi's Notes

Explorer