Vector Multiplication Using the Neon Coprocessor instructions on ARM64

Last post I showed how to do multiplication for a vector of integers using ARM64 instructions. Lots of use cases require these kinds of operations to be performed in bulk. The Neon coprocessor has instructions that allow for the parallel loading and multiplication of numbers. Here’s my simplistic test of these instructions.

Before I start chagning the assembly, I modified the C code to set and send 4 values per vecotro instead of the 3 I had last time.

diff --git a/math.c b/math.c
index 4b44a56..9b1aa3d 100644
--- a/math.c
+++ b/math.c
@@ -4,18 +4,18 @@
 #include "vecmult.h"
-long V1[] = {1,2,3};
-long V2[] = {4,5,6};
+long V1[] = {1,2,3,4};
+long V2[] = {5,6,7,8};
 int main(){
-       long V3[] = {0,0,0};
+       long V3[] = {0,0,0,0};
-       int  r = vecmult(V1, V2, 3, V3);
+       int  r = vecmult(V1, V2, 4, V3);
-       for (int i = 0; i < 3; i++){
+       for (int i = 0; i < 4; i++){
                printf ("%d   * %d  = %d\n", V1[i], V2[i], V3[i]);

Here is the new assembly code in its entirety:

.global vecmult
        cmp     x2, #0x0
        b.le    fini 
        mov     x4, #0x0                        // #0
loop:   ldp     q5, q6, [x0]
        ldp     q7, q8, [x1]
        mul     v5.4s, v5.4s, v7.4s
        mul     v6.4s, v6.4s, v8.4s
        str     q5, [x3]
        add     x5, x3, #16
        str     q6, [x5]
        add     x4, x4, #0x4
        cmp     x2, x4    loop  // b.any
fini:   mov     w0, #0x0                        // #0

Let me start by pointing out the obvious: add x4, x4, #0x4 to control the loop. Instead of processing one cell each time through the loop, we process four, and this allows the loop counting to match.

the ldp operations load in two (a pair) of values into the q5, q6, q7, and a8 registers.

There are two mul operations, each one operates on two pairs of registers in parallel. Yep, we only get double performance here.

The registers in the Neon coprocessor are 128 Bits long. Thus, for the 64 bit values we have, we can only operate on two at the same time. If we were to use 32 bit values, we would double it (4 at a time) and for 16 bit values we’d double it again, 8 at a time. The question is, for parallelized operations of large numbers of multiplications like these, what size values are you going to need? While my knee-jerk reaction is that 32 bit seems like the sweet spot, it really depend on the workload.

More to follow on this.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.