llvm-exegesis Benchmark: SBB form vs. current LLVM output
Setup
- CPU: AMD Ryzen 5 7600 (znver4, Zen 4 microarchitecture)
- LLVM: 23.0.0git
- Tool:
~/llvm-project/build/bin/llvm-exegesis - Mode:
latency(serial loop — each step feeds the next) andinverse_throughput - Snippet: one CRC step with output fed back into the input register (models real loop)
All measurements in cycles per CRC step (one bit processed).
Right-shift CRC (crc = (crc >> 1) ^ (-(crc & 1) & 0xEDB88320))
| Form | Insns | Latency (cycles) | Inv. Throughput (cycles) |
|---|---|---|---|
| Current LLVM (mov+and+neg+and+shr+xor) | 6 | 4.06 | 4.07 |
| SBB form (shr+sbb+and+xor) | 4 | 4.06 | 4.05 |
Verdict: identical performance. Both forms are bottlenecked by the same 4-cycle carried-dependency chain through the CRC register:
crc → mov(0,elim) → andl(1) → negl(1) → andl(1) → xorl(1) → crc ; current: 4 cycles
crc → shrl(1) → [CF] → sbbl(1) → andl(1) → xorl(1) → crc ; SBB: 4 cycles
The SBB form saves 2 instructions with no latency or throughput improvement on Zen 4. Code-size benefit only: 2 fewer µops to decode/issue per loop iteration.
Left-shift CRC (crc = (crc << 1) ^ ((crc < 0) ? 0x04C11DB7 : 0))
| Form | Insns | Latency (cycles) | Inv. Throughput (cycles) |
|---|---|---|---|
| Current LLVM (leal+mov+sar+and+xor+mov) | 6 | 3.06 | 3.29 |
| SBB form (add+sbb+and+xor) | 4 | 4.05 | 4.04 |
Verdict: current LLVM output is 1 cycle faster (25% better) despite 2 more instructions.
Why the current LLVM lshift form wins
The current form uses leal (%rdi,%rdi), %eax which:
- Writes the doubled value to a separate register (
eax), leavingediintact - Does not touch flags, so the sign-mask computation runs in parallel on a separate chain
With Zen 4 mov-elimination (latency ≈ 0 for movl %reg, %reg), the two chains overlap:
edi → movl(0,elim) → sarl(1) → andl(1) → ┐
edi → leal(1) → xorl(1) → movl(0,elim) → edi
↑ ready at cycle 1 ↑ waits for andl@2
xorl starts cycle 2, finishes cycle 3
The critical path depth is 3 cycles (edi → sarl → andl → xorl, with leal parallel).
Why the SBB lshift form is slower
addl %edi, %edi (= crc << 1) overwrites edi and sets CF, forcing a strictly serial
chain through CF → sbbl → andl → xorl:
edi → addl(1) → edi_new + CF → sbbl(1) → andl(1) → xorl(edi_new@1, eax@3) → edi
chain: 1 +1 +1 [waits until cycle 3] = 4 cycles
The CF dependency serializes what the leal form computes in parallel.
Key insight: LLVM’s leal choice is correct
The bug report (bug-crc-shr-sbb.md) states that LLVM uses leal because it avoids
clobbering the source register, and that this “prevents the SBB optimization”. The
measurements show that LLVM is making the right trade-off: preserving edi for the
parallel sar $31 computation achieves 3 cycles vs the SBB form’s 4 cycles.
The “missed optimization” for the left-shift case should be closed as WONTFIX or reclassified: the current 5-insn form is 1 cycle faster per loop iteration than the “optimal” 4-insn SBB form.
Summary
| CRC variant | Current insns | SBB insns | Current latency | SBB latency | Winner |
|---|---|---|---|---|---|
| Right-shift | 6 | 4 | 4.06 cycles | 4.06 cycles | Tie (SBB smaller) |
| Left-shift | 6 | 4 | 3.06 cycles | 4.05 cycles | Current LLVM |
The right-shift case: SBB form saves 2 instructions at no performance cost on modern out-of-order CPUs (Zen 4, Intel with mov-elimination). Still worth implementing for code density.
The left-shift case: do not implement the SBB form. The current leal-based output is measurably faster (1 cycle/iteration = 25% for the CRC step alone).
Caveat: CPUs without mov-elimination
On CPUs that do not implement mov-elimination (in-order cores, older microarchitectures),
the movl %edi, %ecx in the lshift current form costs 1 cycle, making the critical path
4 cycles — same as the SBB form. On those CPUs the SBB form would be marginally better
(4 insns vs 6). But all modern x86-64 server/desktop CPUs (Intel Ivy Bridge+, AMD Zen+)
support GP register mov-elimination.