Cache oblivious matrix multiplication with Hilbert space-filling curves
A well known problem in computer science is how to do matrix multiplication. Usually, matrices are stored in memory in either row-major or column-major format, i.e., matrices are stored sequentially in either row order or column order. The conventions vary a little, for example Fortran uses column-major and C uses row-major, but the differences between the two are mostly notational and they both share the same disadvantage: either column or row spatial locality is minimised in one dimension but not both.
This has no impact if the only matrix operations used access the elements in one order only, in fact it’s optimal, but it is a problem for operations that stride across both dimensions simultaneously such as matrix multiplication done in the naïve way to calculate C = AB:
for i in 1:rows(A)
for j in 1:cols(B)
for k in 1:cols(A)
C[i, j] += A[i, k] * B[k, j]
For small matrices this works fine; the three matrices fit into the processor cache so the pipeline can be kept full. The problem is for sufficiently large matrices, each iteration causes a cache miss which slows things down considerably.
The traditional method for addressing this problem is blocking. Instead of calculating the matrix C naïvely as above by iterating through entire rows and columns, you calculate it in blocks. This is the approach taken by BLAS. Unfortunately, the optimal block sizes are very processor dependent as the cache architecture varies quite considerably between processors, so it must be tuned very carefully for each machine to achieve the best performance.
A more radical and interesting idea is to change the storage order of the matrix elements. That is, instead of optimising in one dimension only, the elements are stored to preserve spatial locality in both dimensions. This is done by storing elements along a space filling curve. Indeed, if a fractal space filling curve is chosen then matrix multiplication becomes cache oblivious, that is it uses the cache well at all levels.
In particular, the Hilbert fractal space filling curve has been proposed recently as a matrix storage order, in particular as a sparse matrix storage order (Hasse et al. 2007). I therefore thought I’d have a quick play with it and hacked up the following primitive benchmark.
The implementation stores matrices in a dense array indexed using the Hilbert curve. Upon matrix creation, I calculate two arrays for iterating through a matrix by rows or columns (these are the Δi and Δj arrays in the matrix structure). These arrays contain an offset to the next element, so iterating through the elements is as simple as adding the offset to the current pointer. The functions for converting ordinary indices to Hilbert index were copied from the Wikipedia article.
Benchmarking was done by counting elapsed CPU cycles with increasing matrix size. All benchmarking was done on an Intel Atom machine running the Plan 9 operating system. I timed matrix creation (for the Hilbert storage order) as the time it takes to calculate the offset arrays is important.

Here’s the results. The x-axis is the matrix size and the y-axis the clock cycles. The red plot shows the cycles required to create the Hilbert matrix array (and delta arrays), the blue shows the naïve matrix product, in green is the cycles for Hilbert ordering, and finally in purple is the combined init + matrix product for Hilbert ordering. Clearly the Hilbert ordering is much better than the naïve method, though the scale chosen makes it look particularly bad. Here is is with a log axis:

As expected, the naïve method is faster for small matrices that fit nicely in the cache, but the Hilbert ordering is much better for larger matrices where it doesn’t. All in all it’s a pretty nice speed up (4.7×) for pretty minimal coding effort and zero tuning. Bear in mind also that my code is just hacked together with no optimisation.
So I quite like it, it was easy and has appreciable benefit. However, the conversion between indexing systems is a bit heavy. In some ways, Morton ordering is more appealing because the translation is much faster, requiring only bit interleaving by simple bit operations (Wise et al. 2001) replacing the delta arrays I used to traverse by row/col. Also, addressing a Morton ordered matrix as a quadtree is actually quite easy (Wise 2000) so implementing smarter matrix multiplication algorithms such as Strassen’s algorithm is simpler.
There is also an interesting article (Bader 2006) that uses a Pino curve instead of either Morton or Hilbert. They propose an algorithm that does matrix multiplication without any jumps at all. I haven’t had the time to read the article in more detail.
Bugs
The dense storage is wasteful. For square matrices of power 2 it is optimal, for anything otherwise it wastes space. A lot of space if it’s a thin matrix.
References
- Wise et al. 2001: Language support for Morton-order matrices
- Wise 2000: Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free
- Haase et al. 2007: A Hilbert-order multiplication scheme for unstructured sparse matrices
- Bader and Zenger 2006: Cache oblivious matrix multiplication using an element ordering based on a Peano curve