cat /dev/brain/rand > /srv/tumblr

Nov 30 2011

Cache oblivious matrix multiplication with Hilbert space-filling curves

A well known problem in computer science is how to do matrix multiplication. Usually, matrices are stored in memory in either row-major or column-major format, i.e., matrices are stored sequentially in either row order or column order. The conventions vary a little, for example Fortran uses column-major and C uses row-major, but the differences between the two are mostly notational and they both share the same disadvantage: either column or row spatial locality is minimised in one dimension but not both.

This has no impact if the only matrix operations used access the elements in one order only, in fact it’s optimal, but it is a problem for operations that stride across both dimensions simultaneously such as matrix multiplication done in the naïve way to calculate C = AB:

for i in 1:rows(A)
    for j in 1:cols(B)
        for k in 1:cols(A)
            C[i, j] += A[i, k] * B[k, j]

For small matrices this works fine; the three matrices fit into the processor cache so the pipeline can be kept full. The problem is for sufficiently large matrices, each iteration causes a cache miss which slows things down considerably.

The traditional method for addressing this problem is blocking. Instead of calculating the matrix C naïvely as above by iterating through entire rows and columns, you calculate it in blocks. This is the approach taken by BLAS. Unfortunately, the optimal block sizes are very processor dependent as the cache architecture varies quite considerably between processors, so it must be tuned very carefully for each machine to achieve the best performance.

A more radical and interesting idea is to change the storage order of the matrix elements. That is, instead of optimising in one dimension only, the elements are stored to preserve spatial locality in both dimensions. This is done by storing elements along a space filling curve. Indeed, if a fractal space filling curve is chosen then matrix multiplication becomes cache oblivious, that is it uses the cache well at all levels.

In particular, the Hilbert fractal space filling curve has been proposed recently as a matrix storage order, in particular as a sparse matrix storage order (Hasse et al. 2007). I therefore thought I’d have a quick play with it and hacked up the following primitive benchmark.

The implementation stores matrices in a dense array indexed using the Hilbert curve. Upon matrix creation, I calculate two arrays for iterating through a matrix by rows or columns (these are the Δi and Δj arrays in the matrix structure). These arrays contain an offset to the next element, so iterating through the elements is as simple as adding the offset to the current pointer. The functions for converting ordinary indices to Hilbert index were copied from the Wikipedia article.

Benchmarking was done by counting elapsed CPU cycles with increasing matrix size. All benchmarking was done on an Intel Atom machine running the Plan 9 operating system. I timed matrix creation (for the Hilbert storage order) as the time it takes to calculate the offset arrays is important.

Here’s the results. The x-axis is the matrix size and the y-axis the clock cycles. The red plot shows the cycles required to create the Hilbert matrix array (and delta arrays), the blue shows the naïve matrix product, in green is the cycles for Hilbert ordering, and finally in purple is the combined init + matrix product for Hilbert ordering. Clearly the Hilbert ordering is much better than the naïve method, though the scale chosen makes it look particularly bad. Here is is with a log axis:

As expected, the naïve method is faster for small matrices that fit nicely in the cache, but the Hilbert ordering is much better for larger matrices where it doesn’t. All in all it’s a pretty nice speed up (4.7×) for pretty minimal coding effort and zero tuning. Bear in mind also that my code is just hacked together with no optimisation.

So I quite like it, it was easy and has appreciable benefit. However, the conversion between indexing systems is a bit heavy. In some ways, Morton ordering is more appealing because the translation is much faster, requiring only bit interleaving by simple bit operations (Wise et al. 2001) replacing the delta arrays I used to traverse by row/col. Also, addressing a Morton ordered matrix as a quadtree is actually quite easy (Wise 2000) so implementing smarter matrix multiplication algorithms such as Strassen’s algorithm is simpler.

There is also an interesting article (Bader 2006) that uses a Pino curve instead of either Morton or Hilbert. They propose an algorithm that does matrix multiplication without any jumps at all. I haven’t had the time to read the article in more detail.

Bugs

The dense storage is wasteful. For square matrices of power 2 it is optimal, for anything otherwise it wastes space. A lot of space if it’s a thin matrix.

References

Sep 20 2011

Installing cyanogenmod on HTC Desire

To install inferno on android you need a rooted phone. Since the inferno guys use cyanogenmod I thought I’d load that up and go from there. In any case, I was running Frozen Yoghurt and not Ginger Bread so my phone was out on date anyway.

I found some overly complicated instructions on the cyanogenmod wiki that provided a starting point. The process is much simpler if you load up the zip files and then run Revolutionary to install and launch ClockworkMod. Of course all this voids the warranty. Here’s what I did:

  1. Downloaded the cyanogenmod image and the Google apps and place them at the root of the SD card.
  2. Ran Revolutionary to set S-OFF and install ClockworkMod. No need to note down your serial number first, Revolutionary tells it to you.
  3. Revolutionary left the phone in the bootloader, so I launched ClockworkMod recovery directly.
  4. Wiped the data and installed both zip files from the sd card

That was it, my phone is now running Cyanogenmod. First impressions are pretty good, it feels snappier despite all the useless animations (which can be disabled). There’s also a nice guide outlining all the features.

Sep 19 2011

Inferno for android

It’s been a while since my last blog post, time to get back into it with something small. Over the weekend there was a lot of traffic on the 9fans mailing list about a port of Inferno to android. The port replaces the whole Java machinery instead of running on top of it. The video demonstration looks quite snappy, I’m tempted to install it on my HTC desire and have a play.

Jul 01 2011

Controlling GPU temperatures

I’ve recently taken possession of two ATI 6990 graphics cards for the purposes of GPU computation. A few days ago, Paris had a “heat wave” where the temperature reached around 37°. As my apartment does not have any airconditioning, I was worried for the life of my GPUs, especially as they’re overclocked.

My solution was to run the fans at 100% and to hack up a quick program to monitor the core temperatures and adjust the clock speeds down accordingly. This is done using proportional feedback control. As changing the clock speeds flushes the pipeline, I introduced hysteresis around the target temperature. Furthermore, current loads are also monitored, and clock speeds are only incremented if the load is above a threshold. This is to prevent turning up the clock speeds while idle and then burning the GPUs as soon as they’re loaded. The clock speed range is also limited.

It seems to work ok, but the code is quite hacky and I haven’t tested it with any other cards. Moreover, the P constant for the controller was not really tuned that carefully. Use at own risk. Amazingly, it’s written for linux not plan 9.

Mar 26 2011

S3Venti port

In an earlier post I discussed using Amazon S3 as a venti block server for backup using the s3venti code of Richard Bilson. This code was written for plan 9 port and not for a native plan 9 installation. Now that I use plan 9 almost exclusively, I wanted to run the s3venti server from plan 9. Thus I ported it:

The patch is for Richard’s code found on sources in contrib/rcbilson/s3venti. I’ve tested it lightly with the vac tools and also venti/copy and it seems to work fine.

2 notes

Mar 20 2011

Adjusting image contrast

I wanted to do some simple image manipulations under Plan 9 for my photographs. I don’t like to edit my images much and prefer to keep them close to what the camera produces. Thus, I don’t really need many operations, only resizing and image contrast enhancement. The former is already available in Plan 9 with the resample program, but I couldn’t find anything to do the latter.

Here’s a program to do image contrast enhancement through histogram stretching:

This is pretty similar to the autolevels command available in photoshop/gimp, with a default of 1% clipping on the shadows and 0.5% on the highlights. Input/output is via stdin/stdout, and the program reads/writes Plan 9 image format so it forms part of a pipeline. Here’s an example invocation:

    jpg -t9 input.jpg | resample -x 1000 | relevel -h 0.01 | topng > output.png

or with djpeg/cjpeg:

    djpeg -scale 1/4 input.jpg | ppm -t9 | relevel | toppm | cjpeg > output.jpg

Mar 13 2011

Plan 9 CPU server on a floppy

One of the nice things about Plan 9 is that it has separate cpu, auth, and file servers. This means it’s possible to have a diskless CPU server for running numerical stuff that connects to a remote auth and file server. This quite a useful case for me, as I can for example quickly setup any machine as a CPU server and use it to run my numerical code off my file server. I’m going to outline the steps needed here to create a boot floppy (yes floppy) for launching a CPU server. In particular, I’ve been looking at doing this using virtualisation via QEMU.

The first step is to create a kernel that is small enough to fit on a floppy. This can be accomplished by removing fossil and venti from the kernel — as we’re creating a pure CPU server there’s no need for these file systems anyway.

    cd /sys/src/9/pc
    cat pccpuf | sed 's,^.*fossil/fossil.*,#&,' | sed 's,^.*venti/venti.*,#&,' >pccpufu

I’m using KVM on multicore machines, so I want to have SMP capabilities. Unfortunately, I had some trouble with the APIC code in the Plan 9 kernel with KVM virtualised processors. To get it to boot with multiple processors I had to disable the LAPIC frequency check (not a good solution) in /sys/src/9/pc/apic.c:

    344c344
    <           if(lapictimer.hz > hz+(hz/10))
    ---
    >           /*if(lapictimer.hz > hz+(hz/10))
    346c346
    <                   lapictimer.hz, hz);
    ---
    >                   lapictimer.hz, hz);*/

Now the kernel can be compiled:

    mk 'CONF=pccpufu'

The next step is to create a plan9.ini boot file specifying which kernel to load and what fileserver and authservers to use. This is fairly straight forward:

    ramfs
    cat > /tmp/plan9.ini << EOF
    nvr=fd!0!plan9.nvr
    bootfile=fd0!dos!9pccpufu.gz
    bootargs=tcp
    nobootprompt=tcp
    mouseport=ps2
    sysname=rubis
    fs=192.168.0.2
    auth=192.168.0.2
    EOF

The last three lines should be changed depending on your network configuration. If fs and auth are not specified it will be prompted for during boot. Now the boot floppy can be created, mounted, and the kernel copied across:

    pc/bootfloppy /tmp/cpusrv.img /tmp/plan9.ini
    dossrv
    mount -c /srv/dos /n/a: /tmp/cpusrv.img
    gzip < /sys/src/9/pc/9pccpufu > /n/a:/9pccpufu.gz
    unmount /n/a:

That’s it, /tmp/cpusrv.img should be a boot floppy. Launching it using KVM should be as simple as

    qemu -no-kvm-irqchip -fda cpusrv.img -net nic -net user -redir tcp:17010::17010

Of couse you may wish to use more than one cpu (-smp) and more ram (-m).

Bugs

  • The hack I used to get it to boot under QEMU with SMP is obviously not a good solution. The problem may well lie with QEMU not with Plan 9 and I need to investigate more thoroughly.
  • Probably related, but it does not boot under QEMU with more than 8 CPUs.

References

Mar 06 2011

Dealing bridge hands

Last night I wrote a little program to deal random bridge hands. It does a Knuth shuffle using /dev/random as the RNG, sorts each player’s hand, and then prints them out. It can either pretty print the hands or output in PBN. It’s not terribly fast due to /dev/random, which is limited to a few hundred bits a second, but consequently it should generate pretty good deals.

It’s written for plan 9 but it should compile easily under plan 9 port using 9c and 9l.

1 note

Feb 26 2011

Café à l’eau froide

Je viens d’écrire un article sur le café à l’eau froide, une nouvelle technique pour préparer le café qui a des caractéristiques intéressantes. L’article explique ce que j’ai fait pour faire le café en utilisant l’aeropress et présente d’ailleurs mes réfections sur la méthode en général.

2 notes

Feb 17 2011

Hosting venti arenas on Amazon S3

I’ve been using Amazon’s S3 storage service for some critical backups. Namely these are things I think are worth having an off-site backup for in case of fire/theft/stupidity, such as my collection of photos (I don’t backup scans as they are too big, and I can scan them again probably). I started off using s3fuse for accessing s3 buckets, and this worked ok. Sadly, it does have some disadvantages such as leaving the management of incrementals etc. completely up to you. There are other bits of software such as duplicity that bundle things into tar filed, do incremental backups, and upload to s3 buckets, but this style of backup is far too painful and error prone (think Bacula). Also I had some weird problems with rsync.

What I really wanted was a venti server backed by s3, and lo and behold somebody else already had the idea. Thanks to this marvellous work we can simply setup the s3 backed arenas and start vac(1)ing.

After fetching the s3venti from sources I found it didn’t compile on my p9p installation. Luckily the patch was trivial:

ventidat.h:
134a135
>   VtMaxLumpSize = 65536,

After that everything compiled and seems to be working mostly perfectly. I did discover however that EU hosted buckets don’t work. It’s not too critical but if I have time I’ll dig into it a bit further and write a patch.

25 notes

Page 1 of 2