Vector Multiplication in ARM64 assembly

Lets start with the basics: we can multiply the cells two vectors in C code and disassemble the resulting binary. This is a trivial operation of multiplying each cell in the first vector by the corresponding cell in the second vector. It is not a cross product. This will be a naive implementation, but it should get us started.

Here is the C code:


int main(){
	int M1[] = {1,2,3};
	int M2[] = {4,5,6};
	int M3[] = {0, 0, 0};
	for (int i=0; i< 3; i++){
		M3[i] = M1[i] * M2[i];
	}
	return 0;
}

To compile, I take advantage of the builtin properties of Make:

make math.o

To disassemble the binary, we can use objdump. The -d switch gives up the simple version that maps to the code. The -D (capital) switch gives the complete. Lets look at the vector definitions from the complete disassembly:


Disassembly of section .rodata:

0000000000000000 <.rodata>:
   0:	00000001 	udf	#1
   4:	00000002 	udf	#2
   8:	00000003 	udf	#3
   c:	00000000 	udf	#0
  10:	00000004 	udf	#4
  14:	00000005 	udf	#5
  18:	00000006 	udf	#6

Since these values are never modified, they are put into a read-only segment of memory, unlike our solution matrix which does get written. Thus we only see the M1 and M2 Vectors. The udf command means "permanently undefined". Why is this here? I don't know right now....

Here is the disassembly of the main function:


Disassembly of section .text:

0000000000000000 <main>:
   0:	d100c3ff 	sub	sp, sp, #0x30
   4:	90000000 	adrp	x0, 0 <main>
   8:	91000001 	add	x1, x0, #0x0
   c:	910083e0 	add	x0, sp, #0x20
  10:	f9400022 	ldr	x2, [x1]
  14:	f9000002 	str	x2, [x0]
  18:	b9400821 	ldr	w1, [x1, #8]
  1c:	b9000801 	str	w1, [x0, #8]
  20:	90000000 	adrp	x0, 0 <main>
  24:	91000001 	add	x1, x0, #0x0
  28:	910043e0 	add	x0, sp, #0x10
  2c:	f9400022 	ldr	x2, [x1]
  30:	f9000002 	str	x2, [x0]
  34:	b9400821 	ldr	w1, [x1, #8]
  38:	b9000801 	str	w1, [x0, #8]
  3c:	b90003ff 	str	wzr, [sp]
  40:	b90007ff 	str	wzr, [sp, #4]
  44:	b9000bff 	str	wzr, [sp, #8]
  48:	b9002fff 	str	wzr, [sp, #44]
  4c:	14000011 	b	90 <main+0x90>
  50:	b9802fe0 	ldrsw	x0, [sp, #44]
  54:	d37ef400 	lsl	x0, x0, #2
  58:	910083e1 	add	x1, sp, #0x20
  5c:	b8606821 	ldr	w1, [x1, x0]
  60:	b9802fe0 	ldrsw	x0, [sp, #44]
  64:	d37ef400 	lsl	x0, x0, #2
  68:	910043e2 	add	x2, sp, #0x10
  6c:	b8606840 	ldr	w0, [x2, x0]
  70:	1b007c22 	mul	w2, w1, w0
  74:	b9802fe0 	ldrsw	x0, [sp, #44]
  78:	d37ef400 	lsl	x0, x0, #2
  7c:	910003e1 	mov	x1, sp
  80:	b8206822 	str	w2, [x1, x0]
  84:	b9402fe0 	ldr	w0, [sp, #44]
  88:	11000400 	add	w0, w0, #0x1
  8c:	b9002fe0 	str	w0, [sp, #44]
  90:	b9402fe0 	ldr	w0, [sp, #44]
  94:	7100081f 	cmp	w0, #0x2
  98:	54fffdcd 	b.le	50 <main+0x50>
  9c:	52800000 	mov	w0, #0x0                   	// #0
  a0:	9100c3ff 	add	sp, sp, #0x30
  a4:	d65f03c0 	ret

Lets start in the middle. The one instruction that we can key in on is the multiplication, and we see that at offset 0x70:

1b007c22 	mul	w2, w1, w0

Since this is ARM64 assembly, we need to read this as :

operation target input1 input2

or in C style code:

w2 = w1 * w2

The w means wide...or 32 bit values. If they were X they would be 64 bit values. Changing the code from int to long should show the difference:

9b007c22 	mul	x2, x1, x0

Note that the corresponding change in the .rodata. There is padding going on, but the assembler does not attempt to interpret the values:


Disassembly of section .rodata:

0000000000000000 <.rodata>:
   0:	00000001 	udf	#1
   4:	00000000 	udf	#0
   8:	00000002 	udf	#2
   c:	00000000 	udf	#0
  10:	00000003 	udf	#3
  14:	00000000 	udf	#0
  18:	00000004 	udf	#4
  1c:	00000000 	udf	#0
  20:	00000005 	udf	#5
  24:	00000000 	udf	#0
  28:	00000006 	udf	#6
  2c:	00000000 	udf	#0

Where does that x2 or w2 value get stored? At offset 80:

 80:	b8206822 	str	w2, [x1, x0]

To figure out what is meant by [x1, x0] lets first figure out what is in those registers...which is kindof a mess. An ARM64 CPU has 16 general purpose registers, but our code only uses 3 of them. What if we optimize? Well, before can really do that, lets isolate the code in its own function. Here it is as a function, which we can compile in its own file and from which we can get an isolated disassembly.

int vecmult(long * M1, long * M2, long len, long * M3){
	for (long i=0; i < len; i++){
		M3[i] = M1[i] * M2[i];
	}
	return 0;
}

Lets build it unoptimized first, which gives this disassembly


Disassembly of section .text:

0000000000000000 <vecmult>:
   0:	d100c3ff 	sub	sp, sp, #0x30
   4:	f9000fe0 	str	x0, [sp, #24]
   8:	f9000be1 	str	x1, [sp, #16]
   c:	f90007e2 	str	x2, [sp, #8]
  10:	f90003e3 	str	x3, [sp]
  14:	f90017ff 	str	xzr, [sp, #40]
  18:	14000014 	b	68 <vecmult+0x68>
  1c:	f94017e0 	ldr	x0, [sp, #40]
  20:	d37df000 	lsl	x0, x0, #3
  24:	f9400fe1 	ldr	x1, [sp, #24]
  28:	8b000020 	add	x0, x1, x0
  2c:	f9400002 	ldr	x2, [x0]
  30:	f94017e0 	ldr	x0, [sp, #40]
  34:	d37df000 	lsl	x0, x0, #3
  38:	f9400be1 	ldr	x1, [sp, #16]
  3c:	8b000020 	add	x0, x1, x0
  40:	f9400001 	ldr	x1, [x0]
  44:	f94017e0 	ldr	x0, [sp, #40]
  48:	d37df000 	lsl	x0, x0, #3
  4c:	f94003e3 	ldr	x3, [sp]
  50:	8b000060 	add	x0, x3, x0
  54:	9b017c41 	mul	x1, x2, x1
  58:	f9000001 	str	x1, [x0]
  5c:	f94017e0 	ldr	x0, [sp, #40]
  60:	91000400 	add	x0, x0, #0x1
  64:	f90017e0 	str	x0, [sp, #40]
  68:	f94017e1 	ldr	x1, [sp, #40]
  6c:	f94007e0 	ldr	x0, [sp, #8]
  70:	eb00003f 	cmp	x1, x0
  74:	54fffd4b 	b.lt	1c <vecmult+0x1c>  // b.tstop
  78:	52800000 	mov	w0, #0x0                   	// #0
  7c:	9100c3ff 	add	sp, sp, #0x30
  80:	d65f03c0 	ret

This still only uses 4 registers. Here is an optimized version:


Disassembly of section .text:

0000000000000000 <vecmult>:
   0:	f100005f 	cmp	x2, #0x0
   4:	5400014d 	b.le	2c <vecmult+0x2c>
   8:	d2800004 	mov	x4, #0x0                   	// #0
   c:	d503201f 	nop
  10:	f8647805 	ldr	x5, [x0, x4, lsl #3]
  14:	f8647826 	ldr	x6, [x1, x4, lsl #3]
  18:	9b067ca5 	mul	x5, x5, x6
  1c:	f8247865 	str	x5, [x3, x4, lsl #3]
  20:	91000484 	add	x4, x4, #0x1
  24:	eb04005f 	cmp	x2, x4
  28:	54ffff41 	b.ne	10 <vecmult+0x10>  // b.any
  2c:	52800000 	mov	w0, #0x0                   	// #0
  30:	d65f03c0 	ret

That is a little better. We user registers 0 through 6. With this we can map back to our original variables.

X5 is the temporary variable that holds the product
Before that, X5 holds the value from one of the two arrays. Based on the fact that it is calculated from X0, and we know that X0 is the first parameter passed in to the function, we can deduce this is M1[i]
X6 is therefore M2[i]
Both are indexed by x4, which implies that it is i. This is reinforced by offset 20: add x4, x4, #0x1 looks a lot like i = i+ 1.
Offset 1c stores X5, the product of the multiplication, into a memory location calculated from x3 and x4. This can be roughly translated as M3[i] = X5.

Lets add some C code to the main in order to display the inputs and outputs. Here' math.c, the main function of the program:


#include 


#include "vecmult.h"


long V1[] = {1,2,3};
long V2[] = {4,5,6};


int main(){
        long V3[] = {0,0,0};

        printf("Hello.\n");
        int  r = vecmult(V1, V2, 3, V3);


        for (int i = 0; i < 3; i++){
                printf ("%d   * %d  = %d\n", V1[i], V2[i], V3[i]);
        }

        return 0;
}

Now we can create an assembly version of our function. Here's the complete vecmult.S file

.global vecmult

vecmult:
 	cmp	x2, #0x0
 	b.le	fini 
 	mov	x4, #0x0                   	// #0
 	nop
loop: 	ldr	x5, [x0, x4, lsl #3]
 	ldr	x6, [x1, x4, lsl #3]
 	mul	x5, x5, x6
 	str	x5, [x3, x4, lsl #3]
 	add	x4, x4, #0x1
 	cmp	x2, x4
 	b.ne	loop  // b.any
fini: 	mov	w0, #0x0                   	// #0
 	ret

And we can write a minimal Makefile just enough to build the whole thing. Here's the makefile:

math: math.o vecmult.o

Manually clean with rm *.o and then

make 
cc    -c -o math.o math.c
cc    -c -o vecmult.o vecmult.S
cc   math.o vecmult.o   -o math
> ./math 
Hello.
1   * 4  = 4
2   * 5  = 10
3   * 6  = 18

With this as a basis we can now investigate how to do these operations using the ARM64 Matrix operations which are a bit more efficient. Coming next....

Adam Young's Web Log

The Notebook of a Programmer Climber Musician Ex-Soldier Woodworker and a few other things

Vector Multiplication in ARM64 assembly

Leave a Reply