How big a matrix can we fit?

Many of the big scientific computing problems can be solved using matrix mathematics. One of my favorite problems to tackle is implementation of vector times matrix = vector. This has utility in many places, one of which is inference in neural networks.

Since an ARM processor has some support for vector and matrix mathematics, I started wondering what were the size limitations we would hit when trying to solve big problems. Here are some notes:


The Basic ARM v8 architecture has 32 Vector-capable registers. Each of these registeres can be broken up to different sizes, from single byte up to 128 values. A common word ize for neural network applications is 32 bit words, and so we are going to work with those. Thus, a single vector can hold 4 words of 32 bits each. If we use one vector for input, one for out, we would expect to have a square 4X4 matrix. This Would use up 6 Vectors. That means we can set up 5 matrixes in memory and take up 30 out of the 32 vector registers.

We could be slightly more effecient if we do double duty with the input/output vectors. If we only ever had a single input vector and a single output vector, we could have 7 matrices of size 4X4, 2 i/o vectors, and 3 vectors left over for other operations.

However, we know that the ARM processors can have 2 parallel execution pipelines. Thus we want to be able to have each of these going at the same time: This means we have an overhead of 4 Vectors. This could still work with 7 matrices, assuming we stored after each iteration.

However, storing the output vectors requires pipeline time as well. If we make the bold assumption that storing a vector and performing the operation take about the same amount of time, we would want to have at least 2 extra output vectors to be able to perform a second operation with the input vectors. So we have an i/o overhead of 6 vectors, meaning we could still have 7 matrices, but we have no vector registers left over for management or calculation.

It also means that we have to block once we are done with a given input register. Again, we would like to pipeline, so we can load one set of input registers while we calculate with a second, and then swap. This gives and overhead of 8 i/o registers (2 active input, 2 active output, 2 loading input, 2 storing output) at a given time. This cuts us down to 6 Matrices of size 4X4 that we can store in registers.

This is our building block. If we can decompose a large matrix multiplication problem into a series of smaller ones, our problem them will likely be limited by how many operations we can perform on a single CPU without having to swap out the subset of matrices on that CPU.

Currently, Ampere machines are at 192 cores, hovering just under the 200 core mark. So we can simplify the math for now and use 200 cores as the basis. If each core can hold 6 matrices, and we can build a single large matrix out of these building blocks, what size matrix do we get? We want a square number less than 192 X 6, or 1152. 32 X 32 = 1024.

Each of the little squares is a 32 bit value. Each 4X4 square with alternating green and red is a matrix. There are 34 of those squares per row and column. Thus each row represents 5 CPUs plus 2/5ths 6th one; 5 Rows of this matrix represent 27 CPUs.

Check my math. I don’t trust myself on this, but I think it is right.

A follow on question is how to most efficiently move the data around. But that is a tale for another day.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.