Bug: x86_64 sdiv i32 by constants with positive magic wastes one instruction

Title

x86_64: sdiv i32 by constant with positive magic number emits 6 insns instead of 5

Summary

For sdiv i32 %n, d where the Hacker’s Delight magic number M < 2^31 (positive), LLVM emits a redundant movq + shrq $63 pair to extract the sign bit of the 64-bit product n*M. Because M > 0, sign(n*M) = sign(n), so the sign bit can be extracted from the input register before the multiply, saving one instruction.

Affected divisors (W=32)

All d where the signed magic fits in a positive 32-bit value. From Hacker’s Delight Table 10-1: 3, 5, 6, 9, 10, 11, 12, 25, 125, 625, and others.

Reproducer

define i32 @sdiv3(i32 %n) {
  %r = sdiv i32 %n, 3
  ret i32 %r
}
llc -O2 -mtriple=x86_64-pc-linux-gnu

Current output (6 instructions)

sdiv3:
    movslq  %edi, %rax
    imulq   $1431655766, %rax, %rax   ; M = 0x55555556
    movq    %rax, %rcx                ; ← redundant copy
    shrq    $63, %rcx                 ; ← sign of product (2 insns for 1 bit)
    shrq    $32, %rax
    addl    %ecx, %eax
    retq

Expected output (5 instructions)

sdiv3:
    movslq  %edi, %rax                ; reads %edi, does NOT modify it
    shrl    $31, %edi                 ; sign of n (0 or 1)
    imulq   $1431655766, %rax, %rax
    shrq    $32, %rax
    addl    %edi, %eax
    retq

Why it is correct

  1. movslq %edi, %rax sign-extends %edi into %rax but leaves %edi unchanged.
  2. M = 1431655766 = 0x55555556 < 2^31, so M is positive when sign-interpreted.
  3. For all i32 n: |n| < 2^31, M < 2^31, so |n*M| < 2^62 — no 64-bit overflow.
  4. Therefore sign(n*M) = sign(n), and (n*M) >> 63 == n >> 31 (as 0/1 unsigned).
  5. Extracting the sign correction from %edi (shrl $31) is equivalent to extracting it from the product (movq + shrq $63) but uses one fewer instruction.

The same argument applies to all divisors where M < 2^31.

Why this does not apply to sdiv i64

The 64-bit path uses the 1-operand imulq %rcx form (128-bit product in rdx:rax), which always requires a final movq %rdx, %rax to place the result in the return register. This extra move offsets any saving, so 6 instructions remains optimal.

Where to fix

The sign-correction extraction logic in X86ISelLowering.cpp / DAGCombine for the SMULH-based division-by-constant lowering on 32-bit operands. When selecting the sign bit source, check if the magic number is positive and, if so, use SRL(n, 31) instead of SRL(SMULH(n, M), 63).