Generic `llvm.clmul` vector lowering on X86

Summary

This is a code-quality / optimization request, not a correctness bug.

Current-head LLVM lowers llvm.clmul.v4i32 and llvm.clmul.v8i32 correctly on X86 with pclmulqdq / vpclmulqdq available, but the generated code is fully unrolled and shuffle-heavy.

The current X86 lowering code explicitly says:

“Only PCLMUL required as we always unroll clmul vectors.”

That policy is visible in the generated assembly for v4i32 and v8i32.

Why this might be worth filing

The generic llvm.clmul intrinsic is now upstream and intended to be used by frontends.
Scalar i64 and vector v2i64 / v4i64 lower well on X86.
The v4i32 / v8i32 cases work, but the backend currently handles them via repeated unrolling, lane replication, and reconstruction.
This looks like a reasonable target for better legalization or DAG combines, even if the exact optimal sequence is target- and cost-model-dependent.

Minimal IR repro

File: vector-clmul.ll

Run:

~/llvm-project/build/bin/llc -O2 \
  -mtriple=x86_64-unknown-linux-gnu \
  -mattr=+avx,+pclmul,+vpclmulqdq \
  vector-clmul.ll -o - | clean_asm -

Frontend repro

File: rust-clmul-vectors.rs

This uses nightly Rust and directly links to the LLVM intrinsic:

rustc rust-clmul-vectors.rs \
  --crate-type=lib -O --emit=llvm-ir \
  -C target-feature=+avx,+pclmulqdq,+vpclmulqdq

That emits calls to:

@llvm.clmul.v4i32
@llvm.clmul.v8i32

Notes:

This uses unstable Rust features:
- repr_simd
- simd_ffi
- link_llvm_intrinsics
So the Rust file is mainly useful as a frontend proof that a source language can reach these intrinsics today.

Observed current-head X86 output

clmul_v4:

vpclmulqdq $0,  %xmm1, %xmm0, %xmm2
vpshufd    $85, %xmm1, %xmm3
vpshufd    $85, %xmm0, %xmm4
vpclmulqdq $0,  %xmm3, %xmm4, %xmm3
vpunpckldq %xmm3, %xmm2, %xmm2
vpclmulqdq $17, %xmm1, %xmm0, %xmm3
vmovq      %xmm3, %rax
vpinsrd    $2, %eax, %xmm2, %xmm2
vpshufd    $255, %xmm1, %xmm1
vpshufd    $255, %xmm0, %xmm0
vpclmulqdq $0,  %xmm1, %xmm0, %xmm0
vmovq      %xmm0, %rax
vpinsrd    $3, %eax, %xmm2, %xmm0
retq

clmul_v8:

vextractf128 $1, %ymm1, %xmm2
vextractf128 $1, %ymm0, %xmm3
vpclmulqdq   $0, %xmm2, %xmm3, %xmm4
...
vpclmulqdq   $0, %xmm1, %xmm0, %xmm3
...
vinsertf128  $1, %xmm2, %ymm0, %ymm0
retq

Scope of the request

This is probably best filed as:

an X86 codegen improvement request, or
a question asking whether v4i32 / v8i32 generic CLMUL legalization can be improved beyond the current unrolled shuffle-heavy strategy.

Suggested issue title

X86: llvm.clmul.v4i32/v8i32 lowering is heavily unrolled and shuffle-heavy

Caveat

This is weaker than a classic “missed single-instruction idiom” report.

I have not proven a unique better instruction sequence, only that:

current lowering is structurally expensive, and
X86 lowering contains an explicit “always unroll clmul vectors” policy that likely leaves room for improvement.

Takashi's Notes

Explorer

bug-report-notes

Generic `llvm.clmul` vector lowering on X86

Summary

Why this might be worth filing

Minimal IR repro

Frontend repro

Observed current-head X86 output

Scope of the request

Suggested issue title

Caveat

Graph View

Table of Contents

Takashi's Notes

Explorer

bug-report-notes

Generic llvm.clmul vector lowering on X86

Summary

Why this might be worth filing

Minimal IR repro

Frontend repro

Observed current-head X86 output

Scope of the request

Suggested issue title

Caveat

Graph View

Table of Contents

Generic `llvm.clmul` vector lowering on X86