In the discussion about waste, there was discussion about only running what you needed for what you’re trying to do.
The C version produced 0.8 adds per second.
But looking at the loop again:
There are four things that need to happen:
i
is checked to be less thancount
input[i]
is loadedinput[i]
is added to sum- i is incremented.
for (int i = 0; i < count; i++)
sum += input[i]
So we should theoretically have at max 0.25 instructions per clock. A chip is capable of doing a certain number of instructions at the same time, so we can assume that since there were ~0.8 adds per clock with this naive code, and there are actually 4 instructions going on, there are ~3.2 IPC.
This post discusses unrolling a loop, but that’s something a compiler can do for us anyway.
But it focuses on finding the bottleneck to optimizing parallelism, which is that sum
is both a source and destination for the add. So, we can break it by doing it in parallel for the compiler by breaking up the add into four parts:
typedef unsigned int u32;
u32 QuadScalar(u32 Count, u32 *Input)
{
u32 SumA = 0;
u32 SumB = 0;
u32 SumC = 0;
u32 SumD = 0;
for(u32 Index = 0; Index < Count; Index += 4)
{
SumA += Input[Index + 0];
SumB += Input[Index + 1];
SumC += Input[Index + 2];
SumD += Input[Index + 3];
}
u32 Sum = SumA + SumB + SumC + SumD;
return Sum;
}
This about doubles the throughput to 1.8 IPC.
Zen 5 chips (the current AMD Ryzens have about 8 IPC, so this could potentially run even faster.)