Bug: x86_64 (W-1) - ctlz(x, false) emits 4 insns instead of optimal 3

Title

x86_64: 31 - ctlz(i32, false) emits 4 instructions; bsr + movl $-1 + cmovz suffices

Summary

(W-1) - ctlz(x, i1 false) — the “safe” integer log2 that returns −1 for x = 0 — generates four instructions on x86_64 without LZCNT, but three instructions are sufficient using bsr + movl $-1 + cmovz.

Reproducer

declare i32 @llvm.ctlz.i32(i32, i1 immarg)
 
define i32 @ilog2_safe(i32 %x) {
  %lz = call i32 @llvm.ctlz.i32(i32 %x, i1 false)
  %r  = sub i32 31, %lz
  ret i32 %r
}
llc -O2 -mtriple=x86_64-pc-linux-gnu

Current output (4 instructions)

ilog2_safe:
    movl  $63, %eax          ← pre-load 63 so that x=0 gives -1 after XOR+ADD
    bsrl  %edi, %eaxBSR overwrites for x≠0; leaves 63 if x=0 (Intel behaviour)
    xorl  $-32, %eax          ← combined correction
    addl  $32, %eax

Expected output (3 instructions)

ilog2_safe:
    bsrl  %edi, %eax          ← ZF=1 if x=0; eax=bsr(x) for x≠0 (undef for x=0)
    movl  $-1,  %ecx          ← prepare fallback value
    cmovzl %ecx, %eax         ← if ZF (x=0): eax=-1; otherwise eax=bsr(x)

Why the 3-instruction form is correct

  1. bsrl %edi, %eax:

    • x ≠ 0: writes bsr(x) = bit position of highest set bit = 31 − ctlz(x) into %eax and clears ZF.
    • x = 0: sets ZF = 1. The result written to %eax is undefined per the x86 ISA, but that does not matter because step 3 overwrites it unconditionally.
  2. movl $-1, %ecx: prepare the correct return value for x = 0 (31 − ctlz(0) = 31 − 32 = −1).

  3. cmovzl %ecx, %eax:

    • ZF = 1 (x = 0): %eax ← −1. The undefined BSR result is discarded. ✓
    • ZF = 0 (x ≠ 0): %eax is unchanged = bsr(x) = 31 − ctlz(x). ✓

No undefined BSR destination behaviour is relied upon. CMOV is part of the x86_64 baseline ISA.

Why the current 4-instruction form works (but is suboptimal)

LLVM pre-loads %eax = 63, then uses BSR (which on all known Intel and AMD processors leaves dest unchanged for src = 0, even though the ISA calls this “undefined”). The subsequent xor $-32; add $32 is the identity for bsr values 0..31, and converts 63 to −1 for the x = 0 case. The approach is clever but uses one more instruction.

Affected widths

Same pattern applies to i64 (63 − ctlz.i64(x, false)bsrq + movl $-1 + cmovzq) and i16 (15 − ctlz.i16(x, false)bsrw + movl $-1 + cmovzw).

Where to fix

X86ISelDAGToDAG.cpp / X86ISelLowering.cpp: the lowering of ISD::CTLZ with is_zero_undef = false. When the result is used in a (W-1) - ctlz subtraction, the combine pass should recognise the pattern and emit bsr + cmovz($-1) instead of the current movl $preload + bsr + xor + add sequence.