Summary

The generic llvm.clmul intrinsic now lowers correctly on x86, but the v4i32 and v8i32 cases are noticeably shuffle-heavy compared to the scalar i64 and vector v2i64 / v4i64 forms.

This is not a correctness bug. It is a tracking issue for codegen quality.

The current X86 lowering code explicitly says:

  • “Only PCLMUL required as we always unroll clmul vectors.”

That policy is visible in the generated assembly for llvm.clmul.v4i32 and llvm.clmul.v8i32.

Minimal IR repro

target triple = "x86_64-unknown-linux-gnu"
 
declare <4 x i32> @llvm.clmul.v4i32(<4 x i32>, <4 x i32>)
declare <8 x i32> @llvm.clmul.v8i32(<8 x i32>, <8 x i32>)
 
define <4 x i32> @clmul_v4(<4 x i32> %a, <4 x i32> %b) {
entry:
  %r = call <4 x i32> @llvm.clmul.v4i32(<4 x i32> %a, <4 x i32> %b)
  ret <4 x i32> %r
}
 
define <8 x i32> @clmul_v8(<8 x i32> %a, <8 x i32> %b) {
entry:
  %r = call <8 x i32> @llvm.clmul.v8i32(<8 x i32> %a, <8 x i32> %b)
  ret <8 x i32> %r
}

Rust frontend repro

Nightly Rust can reach these directly by linking to the LLVM intrinsics:

#![feature(repr_simd, link_llvm_intrinsics, simd_ffi)]
 
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct U32x4([u32; 4]);
 
#[repr(simd)]
#[derive(Copy, Clone)]
pub struct U32x8([u32; 8]);
 
unsafe extern "C" {
    #[link_name = "llvm.clmul.v4i32"]
    fn llvm_clmul_v4i32(a: U32x4, b: U32x4) -> U32x4;
 
    #[link_name = "llvm.clmul.v8i32"]
    fn llvm_clmul_v8i32(a: U32x8, b: U32x8) -> U32x8;
}

This is not meant as a stable user-facing API, just proof that frontends can reach the generic vector intrinsics.

Current x86_64 output

v4i32:

vpclmulqdq $0,  %xmm1, %xmm0, %xmm2
vpshufd    $85, %xmm1, %xmm3
vpshufd    $85, %xmm0, %xmm4
vpclmulqdq $0,  %xmm3, %xmm4, %xmm3
vpunpckldq %xmm3, %xmm2, %xmm2
vpclmulqdq $17, %xmm1, %xmm0, %xmm3
vmovq      %xmm3, %rax
vpinsrd    $2, %eax, %xmm2, %xmm2
vpshufd    $255, %xmm1, %xmm1
vpshufd    $255, %xmm0, %xmm0
vpclmulqdq $0,  %xmm1, %xmm0, %xmm0
vmovq      %xmm0, %rax
vpinsrd    $3, %eax, %xmm2, %xmm0
retq

v8i32 is essentially two copies of the same strategy plus vextractf128 / vinsertf128.

Why this is worth tracking

  • scalar i64 and v2i64 / v4i64 look good
  • v4i32 / v8i32 are correct, but visibly expensive
  • there may be room for better legalization or DAG combines now that generic llvm.clmul is upstream

Caveat

This is weaker than the sharper x86 issues in the project.

I do not yet have a single obviously canonical better sequence, only:

  • current lowering is structurally expensive
  • the implementation explicitly chooses unrolling
  • this may be revisit-worthy as CLMUL frontend usage expands

Local references

  • Source note: llvm-validation/reduced-clmul-vectors/bug-report-notes.md
  • IR repro: llvm-validation/reduced-clmul-vectors/vector-clmul.ll
  • Rust repro: llvm-validation/reduced-clmul-vectors/rust-clmul-vectors.rs