llvm-exegesis Benchmark: SBB form vs. current LLVM output

Setup

  • CPU: AMD Ryzen 5 7600 (znver4, Zen 4 microarchitecture)
  • LLVM: 23.0.0git
  • Tool: ~/llvm-project/build/bin/llvm-exegesis
  • Mode: latency (serial loop — each step feeds the next) and inverse_throughput
  • Snippet: one CRC step with output fed back into the input register (models real loop)

All measurements in cycles per CRC step (one bit processed).


Right-shift CRC (crc = (crc >> 1) ^ (-(crc & 1) & 0xEDB88320))

FormInsnsLatency (cycles)Inv. Throughput (cycles)
Current LLVM (mov+and+neg+and+shr+xor)64.064.07
SBB form (shr+sbb+and+xor)44.064.05

Verdict: identical performance. Both forms are bottlenecked by the same 4-cycle carried-dependency chain through the CRC register:

crc → mov(0,elim) → andl(1) → negl(1) → andl(1) → xorl(1) → crc   ; current: 4 cycles
crc → shrl(1) → [CF] → sbbl(1) → andl(1) → xorl(1) → crc          ; SBB: 4 cycles

The SBB form saves 2 instructions with no latency or throughput improvement on Zen 4. Code-size benefit only: 2 fewer µops to decode/issue per loop iteration.


Left-shift CRC (crc = (crc << 1) ^ ((crc < 0) ? 0x04C11DB7 : 0))

FormInsnsLatency (cycles)Inv. Throughput (cycles)
Current LLVM (leal+mov+sar+and+xor+mov)63.063.29
SBB form (add+sbb+and+xor)44.054.04

Verdict: current LLVM output is 1 cycle faster (25% better) despite 2 more instructions.

Why the current LLVM lshift form wins

The current form uses leal (%rdi,%rdi), %eax which:

  1. Writes the doubled value to a separate register (eax), leaving edi intact
  2. Does not touch flags, so the sign-mask computation runs in parallel on a separate chain

With Zen 4 mov-elimination (latency ≈ 0 for movl %reg, %reg), the two chains overlap:

edi → movl(0,elim) → sarl(1) → andl(1) → ┐
edi → leal(1)                             → xorl(1) → movl(0,elim) → edi
      ↑ ready at cycle 1                    ↑ waits for andl@2
                                            xorl starts cycle 2, finishes cycle 3

The critical path depth is 3 cycles (edi → sarl → andl → xorl, with leal parallel).

Why the SBB lshift form is slower

addl %edi, %edi (= crc << 1) overwrites edi and sets CF, forcing a strictly serial chain through CF → sbblandlxorl:

edi → addl(1) → edi_new + CF → sbbl(1) → andl(1) → xorl(edi_new@1, eax@3) → edi
      chain: 1          +1         +1          [waits until cycle 3]   = 4 cycles

The CF dependency serializes what the leal form computes in parallel.

Key insight: LLVM’s leal choice is correct

The bug report (bug-crc-shr-sbb.md) states that LLVM uses leal because it avoids clobbering the source register, and that this “prevents the SBB optimization”. The measurements show that LLVM is making the right trade-off: preserving edi for the parallel sar $31 computation achieves 3 cycles vs the SBB form’s 4 cycles.

The “missed optimization” for the left-shift case should be closed as WONTFIX or reclassified: the current 5-insn form is 1 cycle faster per loop iteration than the “optimal” 4-insn SBB form.


Summary

CRC variantCurrent insnsSBB insnsCurrent latencySBB latencyWinner
Right-shift644.06 cycles4.06 cyclesTie (SBB smaller)
Left-shift643.06 cycles4.05 cyclesCurrent LLVM

The right-shift case: SBB form saves 2 instructions at no performance cost on modern out-of-order CPUs (Zen 4, Intel with mov-elimination). Still worth implementing for code density.

The left-shift case: do not implement the SBB form. The current leal-based output is measurably faster (1 cycle/iteration = 25% for the CRC step alone).


Caveat: CPUs without mov-elimination

On CPUs that do not implement mov-elimination (in-order cores, older microarchitectures), the movl %edi, %ecx in the lshift current form costs 1 cycle, making the critical path 4 cycles — same as the SBB form. On those CPUs the SBB form would be marginally better (4 insns vs 6). But all modern x86-64 server/desktop CPUs (Intel Ivy Bridge+, AMD Zen+) support GP register mov-elimination.