Generic llvm.clmul vector lowering on X86
Summary
This is a code-quality / optimization request, not a correctness bug.
Current-head LLVM lowers llvm.clmul.v4i32 and llvm.clmul.v8i32 correctly on
X86 with pclmulqdq / vpclmulqdq available, but the generated code is fully
unrolled and shuffle-heavy.
The current X86 lowering code explicitly says:
- “Only PCLMUL required as we always unroll clmul vectors.”
That policy is visible in the generated assembly for v4i32 and v8i32.
Why this might be worth filing
- The generic
llvm.clmulintrinsic is now upstream and intended to be used by frontends. - Scalar
i64and vectorv2i64/v4i64lower well on X86. - The
v4i32/v8i32cases work, but the backend currently handles them via repeated unrolling, lane replication, and reconstruction. - This looks like a reasonable target for better legalization or DAG combines, even if the exact optimal sequence is target- and cost-model-dependent.
Minimal IR repro
File: vector-clmul.ll
Run:
~/llvm-project/build/bin/llc -O2 \
-mtriple=x86_64-unknown-linux-gnu \
-mattr=+avx,+pclmul,+vpclmulqdq \
vector-clmul.ll -o - | clean_asm -Frontend repro
File: rust-clmul-vectors.rs
This uses nightly Rust and directly links to the LLVM intrinsic:
rustc rust-clmul-vectors.rs \
--crate-type=lib -O --emit=llvm-ir \
-C target-feature=+avx,+pclmulqdq,+vpclmulqdqThat emits calls to:
@llvm.clmul.v4i32@llvm.clmul.v8i32
Notes:
- This uses unstable Rust features:
repr_simdsimd_ffilink_llvm_intrinsics
- So the Rust file is mainly useful as a frontend proof that a source language can reach these intrinsics today.
Observed current-head X86 output
clmul_v4:
vpclmulqdq $0, %xmm1, %xmm0, %xmm2
vpshufd $85, %xmm1, %xmm3
vpshufd $85, %xmm0, %xmm4
vpclmulqdq $0, %xmm3, %xmm4, %xmm3
vpunpckldq %xmm3, %xmm2, %xmm2
vpclmulqdq $17, %xmm1, %xmm0, %xmm3
vmovq %xmm3, %rax
vpinsrd $2, %eax, %xmm2, %xmm2
vpshufd $255, %xmm1, %xmm1
vpshufd $255, %xmm0, %xmm0
vpclmulqdq $0, %xmm1, %xmm0, %xmm0
vmovq %xmm0, %rax
vpinsrd $3, %eax, %xmm2, %xmm0
retqclmul_v8:
vextractf128 $1, %ymm1, %xmm2
vextractf128 $1, %ymm0, %xmm3
vpclmulqdq $0, %xmm2, %xmm3, %xmm4
...
vpclmulqdq $0, %xmm1, %xmm0, %xmm3
...
vinsertf128 $1, %xmm2, %ymm0, %ymm0
retqScope of the request
This is probably best filed as:
- an X86 codegen improvement request, or
- a question asking whether
v4i32/v8i32generic CLMUL legalization can be improved beyond the current unrolled shuffle-heavy strategy.
Suggested issue title
X86: llvm.clmul.v4i32/v8i32 lowering is heavily unrolled and shuffle-heavy
Caveat
This is weaker than a classic “missed single-instruction idiom” report.
I have not proven a unique better instruction sequence, only that:
- current lowering is structurally expensive, and
- X86 lowering contains an explicit “always unroll clmul vectors” policy that likely leaves room for improvement.