issue

Summary

The generic llvm.clmul intrinsic now lowers correctly on x86, but the v4i32 and v8i32 cases are noticeably shuffle-heavy compared to the scalar i64 and vector v2i64 / v4i64 forms.

This is not a correctness bug. It is a tracking issue for codegen quality.

The current X86 lowering code explicitly says:

“Only PCLMUL required as we always unroll clmul vectors.”

That policy is visible in the generated assembly for llvm.clmul.v4i32 and llvm.clmul.v8i32.

Minimal IR repro

target triple = "x86_64-unknown-linux-gnu"
 
declare <4 x i32> @llvm.clmul.v4i32(<4 x i32>, <4 x i32>)
declare <8 x i32> @llvm.clmul.v8i32(<8 x i32>, <8 x i32>)
 
define <4 x i32> @clmul_v4(<4 x i32> %a, <4 x i32> %b) {
entry:
  %r = call <4 x i32> @llvm.clmul.v4i32(<4 x i32> %a, <4 x i32> %b)
  ret <4 x i32> %r
}
 
define <8 x i32> @clmul_v8(<8 x i32> %a, <8 x i32> %b) {
entry:
  %r = call <8 x i32> @llvm.clmul.v8i32(<8 x i32> %a, <8 x i32> %b)
  ret <8 x i32> %r
}

Rust frontend repro

Nightly Rust can reach these directly by linking to the LLVM intrinsics:

#![feature(repr_simd, link_llvm_intrinsics, simd_ffi)]
 
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct U32x4([u32; 4]);
 
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct U32x8([u32; 8]);
 
unsafe extern "C" {
    #[link_name = "llvm.clmul.v4i32"]
    fn llvm_clmul_v4i32(a: U32x4, b: U32x4) -> U32x4;
 
    #[link_name = "llvm.clmul.v8i32"]
    fn llvm_clmul_v8i32(a: U32x8, b: U32x8) -> U32x8;
}

This is not meant as a stable user-facing API, just proof that frontends can reach the generic vector intrinsics.

Current x86_64 output

v4i32:

vpclmulqdq $0,  %xmm1, %xmm0, %xmm2
vpshufd    $85, %xmm1, %xmm3
vpshufd    $85, %xmm0, %xmm4
vpclmulqdq $0,  %xmm3, %xmm4, %xmm3
vpunpckldq %xmm3, %xmm2, %xmm2
vpclmulqdq $17, %xmm1, %xmm0, %xmm3
vmovq      %xmm3, %rax
vpinsrd    $2, %eax, %xmm2, %xmm2
vpshufd    $255, %xmm1, %xmm1
vpshufd    $255, %xmm0, %xmm0
vpclmulqdq $0,  %xmm1, %xmm0, %xmm0
vmovq      %xmm0, %rax
vpinsrd    $3, %eax, %xmm2, %xmm0
retq

v8i32 is essentially two copies of the same strategy plus vextractf128 / vinsertf128.

Why this is worth tracking

scalar i64 and v2i64 / v4i64 look good
v4i32 / v8i32 are correct, but visibly expensive
there may be room for better legalization or DAG combines now that generic llvm.clmul is upstream

Caveat

This is weaker than the sharper x86 issues in the project.

I do not yet have a single obviously canonical better sequence, only:

current lowering is structurally expensive
the implementation explicitly chooses unrolling
this may be revisit-worthy as CLMUL frontend usage expands

Local references

Source note: llvm-validation/reduced-clmul-vectors/bug-report-notes.md
IR repro: llvm-validation/reduced-clmul-vectors/vector-clmul.ll
Rust repro: llvm-validation/reduced-clmul-vectors/rust-clmul-vectors.rs

Takashi's Notes

Explorer