Generic llvm.clmul vector lowering on X86

Summary

This is a code-quality / optimization request, not a correctness bug.

Current-head LLVM lowers llvm.clmul.v4i32 and llvm.clmul.v8i32 correctly on X86 with pclmulqdq / vpclmulqdq available, but the generated code is fully unrolled and shuffle-heavy.

The current X86 lowering code explicitly says:

  • “Only PCLMUL required as we always unroll clmul vectors.”

That policy is visible in the generated assembly for v4i32 and v8i32.

Why this might be worth filing

  • The generic llvm.clmul intrinsic is now upstream and intended to be used by frontends.
  • Scalar i64 and vector v2i64 / v4i64 lower well on X86.
  • The v4i32 / v8i32 cases work, but the backend currently handles them via repeated unrolling, lane replication, and reconstruction.
  • This looks like a reasonable target for better legalization or DAG combines, even if the exact optimal sequence is target- and cost-model-dependent.

Minimal IR repro

File: vector-clmul.ll

Run:

~/llvm-project/build/bin/llc -O2 \
  -mtriple=x86_64-unknown-linux-gnu \
  -mattr=+avx,+pclmul,+vpclmulqdq \
  vector-clmul.ll -o - | clean_asm -

Frontend repro

File: rust-clmul-vectors.rs

This uses nightly Rust and directly links to the LLVM intrinsic:

rustc rust-clmul-vectors.rs \
  --crate-type=lib -O --emit=llvm-ir \
  -C target-feature=+avx,+pclmulqdq,+vpclmulqdq

That emits calls to:

  • @llvm.clmul.v4i32
  • @llvm.clmul.v8i32

Notes:

  • This uses unstable Rust features:
    • repr_simd
    • simd_ffi
    • link_llvm_intrinsics
  • So the Rust file is mainly useful as a frontend proof that a source language can reach these intrinsics today.

Observed current-head X86 output

clmul_v4:

vpclmulqdq $0,  %xmm1, %xmm0, %xmm2
vpshufd    $85, %xmm1, %xmm3
vpshufd    $85, %xmm0, %xmm4
vpclmulqdq $0,  %xmm3, %xmm4, %xmm3
vpunpckldq %xmm3, %xmm2, %xmm2
vpclmulqdq $17, %xmm1, %xmm0, %xmm3
vmovq      %xmm3, %rax
vpinsrd    $2, %eax, %xmm2, %xmm2
vpshufd    $255, %xmm1, %xmm1
vpshufd    $255, %xmm0, %xmm0
vpclmulqdq $0,  %xmm1, %xmm0, %xmm0
vmovq      %xmm0, %rax
vpinsrd    $3, %eax, %xmm2, %xmm0
retq

clmul_v8:

vextractf128 $1, %ymm1, %xmm2
vextractf128 $1, %ymm0, %xmm3
vpclmulqdq   $0, %xmm2, %xmm3, %xmm4
...
vpclmulqdq   $0, %xmm1, %xmm0, %xmm3
...
vinsertf128  $1, %xmm2, %ymm0, %ymm0
retq

Scope of the request

This is probably best filed as:

  • an X86 codegen improvement request, or
  • a question asking whether v4i32 / v8i32 generic CLMUL legalization can be improved beyond the current unrolled shuffle-heavy strategy.

Suggested issue title

  • X86: llvm.clmul.v4i32/v8i32 lowering is heavily unrolled and shuffle-heavy

Caveat

This is weaker than a classic “missed single-instruction idiom” report.

I have not proven a unique better instruction sequence, only that:

  • current lowering is structurally expensive, and
  • X86 lowering contains an explicit “always unroll clmul vectors” policy that likely leaves room for improvement.