Summary
Without BMI1, x86_64 codegen for the classic lowest-zero-bit mask idiom
~B & (B + 1)uses 4 instructions where 3 should suffice.
This is a small backend/register-allocation quality issue. With BMI1 enabled,
LLVM already emits the optimal andn-based form.
Reduced LLVM IR repro
target triple = "x86_64-pc-linux-gnu"
define i32 @lowest_zero_bit_mask(i32 %B) {
%nb = xor i32 %B, -1
%bp1 = add i32 %B, 1
%r = and i32 %nb, %bp1
ret i32 %r
}Current output (no BMI1)
leal 1(%rdi), %eax
movl %edi, %ecx
notl %ecx
andl %ecx, %eax
retqBetter output (no BMI1)
leal 1(%rdi), %eax
notl %edi
andl %edi, %eax
retqWhy this is valid
leal 1(%rdi), %eax reads %rdi and writes %eax. After that instruction,
clobbering %edi with notl %edi is fine because the original value of %rdi
is no longer needed. So the extra copy to %ecx is avoidable.
With BMI1
LLVM already emits the good 2-instruction form:
leal 1(%rdi), %eax
andnl %eax, %edi, %eaxSo this only matters for the no-BMI1 path.
Why this seems worth tracking
- the optimization is small but real
- it appears in Gray-code increment code
- the better sequence is straightforward and local
Caveat
This is a minor 4-to-3 instruction improvement, so it is weaker than most of the other issues in the project.
Local references
- Source note:
llvm-validation/ch13-gray-code/README.md - IR repro:
llvm-validation/ch13-gray-code/ch13_blsi_miss.ll