Chapter 11: Some Elementary Functions — LLVM Validation

LLVM version tested: 23.0.0git (build at ~/llvm-project/build/bin)

Already handled correctly

Pattern	IR	LLVM output	Notes
ilog2, x > 0 (§11-4)	`31 - ctlz.i32(x, true)`	`bsrl`	1 instruction
ilog2 i64, x > 0 (§11-4)	`63 - ctlz.i64(x, true)`	`bsrq`	1 instruction
ilog10 approx (§11-4, Fig 11-12)	`(9*(31-ctlz(x))) >> 5`	`bsrl + leal(*9) + shrl $5`	3 insns; LEA for ×9
ilog10 approx variant (§11-4, Fig 11-13)	`(19*(31-ctlz(x))) >> 6`	`bsrl + leal(9) + leal(2+1) + shrl $6`	4 insns; two LEAs for ×19
x^4 via squaring (§11-3)	`(xx)(x*x)`	2 `imull`	binary decomp
x^13 binary decomp (§11-3)	5-multiply chain	5 `imull`	optimal addition-chain length for 13

Missed optimization: `(W−1) − ctlz(x, false)` on x86_64 (without LZCNT)

Problem

(W-1) - ctlz(x, i1 false) — integer log2 that returns −1 for x = 0 — generates 4 instructions but 3 instructions suffice using bsr + cmovz.

Affected pattern: 31 - ctlz.i32(x, false), 63 - ctlz.i64(x, false), 15 - ctlz.i16(x, false).

What LLVM generates (`ilog2_safe`, i32)

movl  $63, %eax          ; pre-load 63 so x=0 case gives -1 after XOR+ADD
bsrl  %edi, %eax          ; BSR writes bsr(x) for x≠0; leaves 63 if x=0 (Intel/AMD behaviour)
xorl  $-32, %eax          ; correction: bsr XOR 0xFFFFFFE0 + 32 = bsr (identity for 0..31)
addl  $32, %eax           ;             63 XOR 0xFFFFFFE0 + 32 = -1 (x=0 case)

Optimal (3 instructions)

bsrl  %edi, %eax          ; ZF=1 if x=0; eax=bsr(x) for x≠0 (undefined for x=0, but overwritten below)
movl  $-1,  %ecx          ; fallback
cmovzl %ecx, %eax         ; if ZF=1 (x=0): eax=-1; else: eax=bsr(x)=31-ctlz(x)

Why the 3-instruction form is correct

bsr sets ZF = 1 iff src = 0. The destination is undefined for src = 0, but cmovz overwrites it immediately — no undefined value escapes.
For x ≠ 0: ZF = 0, cmovz is a no-op, eax = bsr(x) = 31 − ctlz(x). ✓
For x = 0: ZF = 1, cmovz writes −1. 31 − ctlz(0) = 31 − 32 = −1. ✓
cmovz is part of the x86_64 baseline ISA (no extra feature flags needed).

Why the current 4-instruction form is suboptimal

LLVM’s pre-load-63 + BSR + XOR + ADD trick relies on Intel/AMD processors leaving the BSR destination unchanged for src = 0 (called “undefined” in the ISA, but in practice never mutated). This costs one extra instruction.

Same issue for i64 and i16

; i64 — current (4 insns):
movl   $127, %eax
bsrq   %rdi, %rax
xorq   $-64, %rax
addq   $64, %rax
 
; i64 — optimal (3 insns):
bsrq   %rdi, %rax
movq   $-1,  %rcx
cmovzq %rcx, %rax

Note: with LZCNT extension

With -mattr=+lzcnt, LLVM already generates 3 instructions:

lzcntl  %edi, %ecx
movl    $31, %eax
subl    %ecx, %eax

The miss only affects the baseline x86_64 codegen (without LZCNT).

Test files

File	What it tests
`ch11_basic.ll`	Already-handled patterns (ilog2 nonzero, ilog10 approx, ipow)
`ch11_ilog2_safe.ll`	The safe ilog2 / `(W-1)-ctlz(x,false)` miss (i32, i64, i16)
`bug-ilog2-safe.md`	Bug report draft

Takashi's Notes

Explorer

README

Chapter 11: Some Elementary Functions — LLVM Validation

Already handled correctly

Missed optimization: `(W−1) − ctlz(x, false)` on x86_64 (without LZCNT)

Problem

What LLVM generates (`ilog2_safe`, i32)

Optimal (3 instructions)

Why the 3-instruction form is correct

Why the current 4-instruction form is suboptimal

Same issue for i64 and i16

Note: with LZCNT extension

Test files

Graph View

Table of Contents

Takashi's Notes

Explorer

README

Chapter 11: Some Elementary Functions — LLVM Validation

Already handled correctly

Missed optimization: (W−1) − ctlz(x, false) on x86_64 (without LZCNT)

Problem

What LLVM generates (ilog2_safe, i32)

Optimal (3 instructions)

Why the 3-instruction form is correct

Why the current 4-instruction form is suboptimal

Same issue for i64 and i16

Note: with LZCNT extension

Test files

Graph View

Table of Contents

Missed optimization: `(W−1) − ctlz(x, false)` on x86_64 (without LZCNT)

What LLVM generates (`ilog2_safe`, i32)