Lets start with the basics: we can multiply the cells two vectors in C code and disassemble the resulting binary. This is a trivial operation of multiplying each cell in the first vector by the corresponding cell in the second vector. It is not a cross product. This will be a naive implementation, but it should get us started.
Here is the C code:
int main(){
int M1[] = {1,2,3};
int M2[] = {4,5,6};
int M3[] = {0, 0, 0};
for (int i=0; i< 3; i++){
M3[i] = M1[i] * M2[i];
}
return 0;
}
To compile, I take advantage of the builtin properties of Make:
make math.o
To disassemble the binary, we can use objdump. The -d switch gives up the simple version that maps to the code. The -D (capital) switch gives the complete. Lets look at the vector definitions from the complete disassembly:
Disassembly of section .rodata:
0000000000000000 <.rodata>:
0: 00000001 udf #1
4: 00000002 udf #2
8: 00000003 udf #3
c: 00000000 udf #0
10: 00000004 udf #4
14: 00000005 udf #5
18: 00000006 udf #6
Since these values are never modified, they are put into a read-only segment of memory, unlike our solution matrix which does get written. Thus we only see the M1 and M2 Vectors. The udf command means "permanently undefined". Why is this here? I don't know right now....
Here is the disassembly of the main function:
Disassembly of section .text:
0000000000000000 <main>:
0: d100c3ff sub sp, sp, #0x30
4: 90000000 adrp x0, 0 <main>
8: 91000001 add x1, x0, #0x0
c: 910083e0 add x0, sp, #0x20
10: f9400022 ldr x2, [x1]
14: f9000002 str x2, [x0]
18: b9400821 ldr w1, [x1, #8]
1c: b9000801 str w1, [x0, #8]
20: 90000000 adrp x0, 0 <main>
24: 91000001 add x1, x0, #0x0
28: 910043e0 add x0, sp, #0x10
2c: f9400022 ldr x2, [x1]
30: f9000002 str x2, [x0]
34: b9400821 ldr w1, [x1, #8]
38: b9000801 str w1, [x0, #8]
3c: b90003ff str wzr, [sp]
40: b90007ff str wzr, [sp, #4]
44: b9000bff str wzr, [sp, #8]
48: b9002fff str wzr, [sp, #44]
4c: 14000011 b 90 <main+0x90>
50: b9802fe0 ldrsw x0, [sp, #44]
54: d37ef400 lsl x0, x0, #2
58: 910083e1 add x1, sp, #0x20
5c: b8606821 ldr w1, [x1, x0]
60: b9802fe0 ldrsw x0, [sp, #44]
64: d37ef400 lsl x0, x0, #2
68: 910043e2 add x2, sp, #0x10
6c: b8606840 ldr w0, [x2, x0]
70: 1b007c22 mul w2, w1, w0
74: b9802fe0 ldrsw x0, [sp, #44]
78: d37ef400 lsl x0, x0, #2
7c: 910003e1 mov x1, sp
80: b8206822 str w2, [x1, x0]
84: b9402fe0 ldr w0, [sp, #44]
88: 11000400 add w0, w0, #0x1
8c: b9002fe0 str w0, [sp, #44]
90: b9402fe0 ldr w0, [sp, #44]
94: 7100081f cmp w0, #0x2
98: 54fffdcd b.le 50 <main+0x50>
9c: 52800000 mov w0, #0x0 // #0
a0: 9100c3ff add sp, sp, #0x30
a4: d65f03c0 ret
Lets start in the middle. The one instruction that we can key in on is the multiplication, and we see that at offset 0x70:
1b007c22 mul w2, w1, w0
Since this is ARM64 assembly, we need to read this as :
operation target input1 input2
or in C style code:
w2 = w1 * w2
The w means wide...or 32 bit values. If they were X they would be 64 bit values. Changing the code from int to long should show the difference:
9b007c22 mul x2, x1, x0
Note that the corresponding change in the .rodata. There is padding going on, but the assembler does not attempt to interpret the values:
Disassembly of section .rodata:
0000000000000000 <.rodata>:
0: 00000001 udf #1
4: 00000000 udf #0
8: 00000002 udf #2
c: 00000000 udf #0
10: 00000003 udf #3
14: 00000000 udf #0
18: 00000004 udf #4
1c: 00000000 udf #0
20: 00000005 udf #5
24: 00000000 udf #0
28: 00000006 udf #6
2c: 00000000 udf #0
Where does that x2 or w2 value get stored? At offset 80:
80: b8206822 str w2, [x1, x0]
To figure out what is meant by [x1, x0] lets first figure out what is in those registers...which is kindof a mess. An ARM64 CPU has 16 general purpose registers, but our code only uses 3 of them. What if we optimize? Well, before can really do that, lets isolate the code in its own function. Here it is as a function, which we can compile in its own file and from which we can get an isolated disassembly.
int vecmult(long * M1, long * M2, long len, long * M3){
for (long i=0; i < len; i++){
M3[i] = M1[i] * M2[i];
}
return 0;
}
Lets build it unoptimized first, which gives this disassembly
Disassembly of section .text:
0000000000000000 <vecmult>:
0: d100c3ff sub sp, sp, #0x30
4: f9000fe0 str x0, [sp, #24]
8: f9000be1 str x1, [sp, #16]
c: f90007e2 str x2, [sp, #8]
10: f90003e3 str x3, [sp]
14: f90017ff str xzr, [sp, #40]
18: 14000014 b 68 <vecmult+0x68>
1c: f94017e0 ldr x0, [sp, #40]
20: d37df000 lsl x0, x0, #3
24: f9400fe1 ldr x1, [sp, #24]
28: 8b000020 add x0, x1, x0
2c: f9400002 ldr x2, [x0]
30: f94017e0 ldr x0, [sp, #40]
34: d37df000 lsl x0, x0, #3
38: f9400be1 ldr x1, [sp, #16]
3c: 8b000020 add x0, x1, x0
40: f9400001 ldr x1, [x0]
44: f94017e0 ldr x0, [sp, #40]
48: d37df000 lsl x0, x0, #3
4c: f94003e3 ldr x3, [sp]
50: 8b000060 add x0, x3, x0
54: 9b017c41 mul x1, x2, x1
58: f9000001 str x1, [x0]
5c: f94017e0 ldr x0, [sp, #40]
60: 91000400 add x0, x0, #0x1
64: f90017e0 str x0, [sp, #40]
68: f94017e1 ldr x1, [sp, #40]
6c: f94007e0 ldr x0, [sp, #8]
70: eb00003f cmp x1, x0
74: 54fffd4b b.lt 1c <vecmult+0x1c> // b.tstop
78: 52800000 mov w0, #0x0 // #0
7c: 9100c3ff add sp, sp, #0x30
80: d65f03c0 ret
This still only uses 4 registers. Here is an optimized version:
Disassembly of section .text:
0000000000000000 <vecmult>:
0: f100005f cmp x2, #0x0
4: 5400014d b.le 2c <vecmult+0x2c>
8: d2800004 mov x4, #0x0 // #0
c: d503201f nop
10: f8647805 ldr x5, [x0, x4, lsl #3]
14: f8647826 ldr x6, [x1, x4, lsl #3]
18: 9b067ca5 mul x5, x5, x6
1c: f8247865 str x5, [x3, x4, lsl #3]
20: 91000484 add x4, x4, #0x1
24: eb04005f cmp x2, x4
28: 54ffff41 b.ne 10 <vecmult+0x10> // b.any
2c: 52800000 mov w0, #0x0 // #0
30: d65f03c0 ret
That is a little better. We user registers 0 through 6. With this we can map back to our original variables.
- X5 is the temporary variable that holds the product
- Before that, X5 holds the value from one of the two arrays. Based on the fact that it is calculated from X0, and we know that X0 is the first parameter passed in to the function, we can deduce this is M1[i]
- X6 is therefore M2[i]
- Both are indexed by x4, which implies that it is i. This is reinforced by offset 20: add x4, x4, #0x1 looks a lot like i = i+ 1.
- Offset 1c stores X5, the product of the multiplication, into a memory location calculated from x3 and x4. This can be roughly translated as M3[i] = X5.
Lets add some C code to the main in order to display the inputs and outputs. Here' math.c, the main function of the program:
#include
#include "vecmult.h"
long V1[] = {1,2,3};
long V2[] = {4,5,6};
int main(){
long V3[] = {0,0,0};
printf("Hello.\n");
int r = vecmult(V1, V2, 3, V3);
for (int i = 0; i < 3; i++){
printf ("%d * %d = %d\n", V1[i], V2[i], V3[i]);
}
return 0;
}
Now we can create an assembly version of our function. Here's the complete vecmult.S file
.global vecmult
vecmult:
cmp x2, #0x0
b.le fini
mov x4, #0x0 // #0
nop
loop: ldr x5, [x0, x4, lsl #3]
ldr x6, [x1, x4, lsl #3]
mul x5, x5, x6
str x5, [x3, x4, lsl #3]
add x4, x4, #0x1
cmp x2, x4
b.ne loop // b.any
fini: mov w0, #0x0 // #0
ret
And we can write a minimal Makefile just enough to build the whole thing. Here's the makefile:
math: math.o vecmult.o
Manually clean with rm *.o and then
make
cc -c -o math.o math.c
cc -c -o vecmult.o vecmult.S
cc math.o vecmult.o -o math
> ./math
Hello.
1 * 4 = 4
2 * 5 = 10
3 * 6 = 18
With this as a basis we can now investigate how to do these operations using the ARM64 Matrix operations which are a bit more efficient. Coming next....