Last post I showed how to do multiplication for a vector of integers using ARM64 instructions. Lots of use cases require these kinds of operations to be performed in bulk. The Neon coprocessor has instructions that allow for the parallel loading and multiplication of numbers. Here’s my simplistic test of these instructions.

Before I start chagning the assembly, I modified the C code to set and send 4 values per vecotro instead of the 3 I had last time.

diff --git a/math.c b/math.c index 4b44a56..9b1aa3d 100644 --- a/math.c +++ b/math.c @@ -4,18 +4,18 @@ #include "vecmult.h" -long V1[] = {1,2,3}; -long V2[] = {4,5,6}; +long V1[] = {1,2,3,4}; +long V2[] = {5,6,7,8}; int main(){ - long V3[] = {0,0,0}; + long V3[] = {0,0,0,0}; printf("Hello.\n"); - int r = vecmult(V1, V2, 3, V3); + int r = vecmult(V1, V2, 4, V3); - for (int i = 0; i < 3; i++){ + for (int i = 0; i < 4; i++){ printf ("%d * %d = %d\n", V1[i], V2[i], V3[i]); } |

Here is the new assembly code in its entirety:

.global vecmult vecmult: cmp x2, #0x0 b.le fini mov x4, #0x0 // #0 nop loop: ldp q5, q6, [x0] ldp q7, q8, [x1] mul v5.4s, v5.4s, v7.4s mul v6.4s, v6.4s, v8.4s str q5, [x3] add x5, x3, #16 str q6, [x5] add x4, x4, #0x4 cmp x2, x4 b.ne loop // b.any fini: mov w0, #0x0 // #0 ret |

Let me start by pointing out the obvious: add x4, x4, #0x4 to control the loop. Instead of processing one cell each time through the loop, we process four, and this allows the loop counting to match.

the ldp operations load in two (a pair) of values into the q5, q6, q7, and a8 registers.

There are two **mul** operations, each one operates on two pairs of registers in parallel. Yep, we only get double performance here.

The registers in the Neon coprocessor are 128 Bits long. Thus, for the 64 bit values we have, we can only operate on two at the same time. If we were to use 32 bit values, we would double it (4 at a time) and for 16 bit values we’d double it again, 8 at a time. The question is, for parallelized operations of large numbers of multiplications like these, what size values are you going to need? While my knee-jerk reaction is that 32 bit seems like the sweet spot, it really depend on the workload.

More to follow on this.