Bug: x86_64 (W-1) - ctlz(x, false) emits 4 insns instead of optimal 3
Title
x86_64: 31 - ctlz(i32, false) emits 4 instructions; bsr + movl $-1 + cmovz suffices
Summary
(W-1) - ctlz(x, i1 false) — the “safe” integer log2 that returns −1 for x = 0 —
generates four instructions on x86_64 without LZCNT, but three instructions are sufficient
using bsr + movl $-1 + cmovz.
Reproducer
declare i32 @llvm.ctlz.i32(i32, i1 immarg)
define i32 @ilog2_safe(i32 %x) {
%lz = call i32 @llvm.ctlz.i32(i32 %x, i1 false)
%r = sub i32 31, %lz
ret i32 %r
}llc -O2 -mtriple=x86_64-pc-linux-gnuCurrent output (4 instructions)
ilog2_safe:
movl $63, %eax ← pre-load 63 so that x=0 gives -1 after XOR+ADD
bsrl %edi, %eax ← BSR overwrites for x≠0; leaves 63 if x=0 (Intel behaviour)
xorl $-32, %eax ← combined correction
addl $32, %eax ←Expected output (3 instructions)
ilog2_safe:
bsrl %edi, %eax ← ZF=1 if x=0; eax=bsr(x) for x≠0 (undef for x=0)
movl $-1, %ecx ← prepare fallback value
cmovzl %ecx, %eax ← if ZF (x=0): eax=-1; otherwise eax=bsr(x)Why the 3-instruction form is correct
-
bsrl %edi, %eax:- x ≠ 0: writes
bsr(x)= bit position of highest set bit =31 − ctlz(x)into%eaxand clears ZF. - x = 0: sets ZF = 1. The result written to
%eaxis undefined per the x86 ISA, but that does not matter because step 3 overwrites it unconditionally.
- x ≠ 0: writes
-
movl $-1, %ecx: prepare the correct return value for x = 0 (31 − ctlz(0) = 31 − 32 = −1). -
cmovzl %ecx, %eax:- ZF = 1 (x = 0):
%eax← −1. The undefined BSR result is discarded. ✓ - ZF = 0 (x ≠ 0):
%eaxis unchanged =bsr(x)=31 − ctlz(x). ✓
- ZF = 1 (x = 0):
No undefined BSR destination behaviour is relied upon. CMOV is part of the x86_64 baseline ISA.
Why the current 4-instruction form works (but is suboptimal)
LLVM pre-loads %eax = 63, then uses BSR (which on all known Intel and AMD processors
leaves dest unchanged for src = 0, even though the ISA calls this “undefined”). The
subsequent xor $-32; add $32 is the identity for bsr values 0..31, and converts 63
to −1 for the x = 0 case. The approach is clever but uses one more instruction.
Affected widths
Same pattern applies to i64 (63 − ctlz.i64(x, false) → bsrq + movl $-1 + cmovzq)
and i16 (15 − ctlz.i16(x, false) → bsrw + movl $-1 + cmovzw).
Where to fix
X86ISelDAGToDAG.cpp / X86ISelLowering.cpp: the lowering of ISD::CTLZ with
is_zero_undef = false. When the result is used in a (W-1) - ctlz subtraction,
the combine pass should recognise the pattern and emit bsr + cmovz($-1) instead
of the current movl $preload + bsr + xor + add sequence.