Summary
The generic llvm.clmul intrinsic now lowers correctly on x86, but the v4i32
and v8i32 cases are noticeably shuffle-heavy compared to the scalar i64 and
vector v2i64 / v4i64 forms.
This is not a correctness bug. It is a tracking issue for codegen quality.
The current X86 lowering code explicitly says:
- “Only PCLMUL required as we always unroll clmul vectors.”
That policy is visible in the generated assembly for llvm.clmul.v4i32 and
llvm.clmul.v8i32.
Minimal IR repro
target triple = "x86_64-unknown-linux-gnu"
declare <4 x i32> @llvm.clmul.v4i32(<4 x i32>, <4 x i32>)
declare <8 x i32> @llvm.clmul.v8i32(<8 x i32>, <8 x i32>)
define <4 x i32> @clmul_v4(<4 x i32> %a, <4 x i32> %b) {
entry:
%r = call <4 x i32> @llvm.clmul.v4i32(<4 x i32> %a, <4 x i32> %b)
ret <4 x i32> %r
}
define <8 x i32> @clmul_v8(<8 x i32> %a, <8 x i32> %b) {
entry:
%r = call <8 x i32> @llvm.clmul.v8i32(<8 x i32> %a, <8 x i32> %b)
ret <8 x i32> %r
}Rust frontend repro
Nightly Rust can reach these directly by linking to the LLVM intrinsics:
#![feature(repr_simd, link_llvm_intrinsics, simd_ffi)]
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct U32x4([u32; 4]);
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct U32x8([u32; 8]);
unsafe extern "C" {
#[link_name = "llvm.clmul.v4i32"]
fn llvm_clmul_v4i32(a: U32x4, b: U32x4) -> U32x4;
#[link_name = "llvm.clmul.v8i32"]
fn llvm_clmul_v8i32(a: U32x8, b: U32x8) -> U32x8;
}This is not meant as a stable user-facing API, just proof that frontends can reach the generic vector intrinsics.
Current x86_64 output
v4i32:
vpclmulqdq $0, %xmm1, %xmm0, %xmm2
vpshufd $85, %xmm1, %xmm3
vpshufd $85, %xmm0, %xmm4
vpclmulqdq $0, %xmm3, %xmm4, %xmm3
vpunpckldq %xmm3, %xmm2, %xmm2
vpclmulqdq $17, %xmm1, %xmm0, %xmm3
vmovq %xmm3, %rax
vpinsrd $2, %eax, %xmm2, %xmm2
vpshufd $255, %xmm1, %xmm1
vpshufd $255, %xmm0, %xmm0
vpclmulqdq $0, %xmm1, %xmm0, %xmm0
vmovq %xmm0, %rax
vpinsrd $3, %eax, %xmm2, %xmm0
retqv8i32 is essentially two copies of the same strategy plus vextractf128 /
vinsertf128.
Why this is worth tracking
- scalar
i64andv2i64/v4i64look good v4i32/v8i32are correct, but visibly expensive- there may be room for better legalization or DAG combines now that generic
llvm.clmulis upstream
Caveat
This is weaker than the sharper x86 issues in the project.
I do not yet have a single obviously canonical better sequence, only:
- current lowering is structurally expensive
- the implementation explicitly chooses unrolling
- this may be revisit-worthy as CLMUL frontend usage expands
Local references
- Source note:
llvm-validation/reduced-clmul-vectors/bug-report-notes.md - IR repro:
llvm-validation/reduced-clmul-vectors/vector-clmul.ll - Rust repro:
llvm-validation/reduced-clmul-vectors/rust-clmul-vectors.rs