GPGPU with OpenGL and VisualWorks -- My misinterpretations of life -- Michael Lucas-Smith

2009-02-17

We're about to enter a fairly technical discussion on how to use OpenGL to do general computing with the GPU from VisualWorks. If you're an expert with OpenGL or CUDA, this is all old hat for you, but you may find it an interesting read any way. The following discussion takes place between 9:00pm and 10:00pm (for you 24 fans out there!)

We'll be referring back to the following results many times over the course of this discussion, so here they are:

The runs column is just so you can give an error correction factor to the numbers. The vectors column describes the number of four-dimensional floating point vectors that we're going to feed through our mathematical operation. The math in question is this:

result = matrix * vector

The matrix in question is a 4x4 perspective projection matrix.

The next three columns describe three different techniques we use to do our simple math equation. The first column refers to straight Smalltalk code which uses VM primitives to do the computations. We're using a FloatArray for all three input and output buffers, so the number of objects we generate doing the floating point math with the VM is minimal compared to have millions of Float objects instantiated at the same time. For those of you who do not know - the VisualWorks VM does not tag or unbox floats, it keeps them as plain jane objects all the time, which can make for an interesting garbage collection mess at times.

The next technique is "texture" which is best described here at http://www.gpgpu.org which documents how to do it and many other interesting things people have been doing with GPUs. The technique basically involves rendering out on to a buffer that is attached as a texture, you can then download the texture in to main memory as an array of floats. You upload data by creating a texture and putting the floats in to that texture, then texture mapping it inside a fragment shader to do the math. If you didn't follow all that, don't sweat the details too much - this blog post is more about understanding the results than the fine details of how to do it. For information on how to do it, read the gpgpu website or look at Lesson05 in OpenGL-Lessons (public store).

The final technique is the new kid on the block. It's an extension for OpenGL and it is included in OpenGL 3.0, which lets you attach vertex shader outputs to a vertex buffer and skip fragment shading altogether. Instead of doing the computation using textures and the fragment shader, instead we use array buffers and the vertex shader. The draw back is that very few graphics cards support EXT_transform_buffer yet. The results in this case come from Apple's software renderer, so the run entirely on the CPU, not on the GPU.

The numbers against the three techniques are broken up between time to compute and time to transfer the data back to Smalltalk memory. We can see from the VM column that the time to compute doubles every time we double the number of vectors. The time to transfer is 0ms because all the data is already in Smalltalk memory. It is worth noting here that if we didn't use FloatArray for this, the time to compute grows exponentially as more and more garbage is generated. The time to compute is semi-respectable so long as we stay under 64 vectors. After that, the time doubling really starts to be something that could cramp your style if you're trying to achieve a steady 60fps in a real time rendering engine (eg: a game).

Next, let's analyse the transform feedback numbers. These are important, because they're also running on the CPU, but they're not doing it with the VM primitives. They're doing it with GPU simulation code written in C and optimized by people with too much time on their hands. The numbers we see in this column are fairly representative of how fast floating point operations could go if our VM tagged floats or had a way to box/unbox them around the math loop. The time to transfer data is negligible because it's just doing a data copy from main memory to main memory. We see that it also has the same doubling effect as the VM, except at a reduced magnitude. This doubling effect kicks in at around 16384 vectors. Up until that point, we hit a solid unshakeable 1ms to do any math at all, which is sort of unreasonable in its own way.

Finally, we take a look at the texture column, the middle technique, the one advocated by gpgpu.org. This is the technique that we should pay the most attention to as it is supported by the most number of GPUs out in the market place today. Up until 524288 vectors, the cost to do the math itself is negligible. Where we spend our time is copying data between the video memory and main memory, which is the second number in the equation. We can see that at around 16384 vectors, we hit the DMA buffer threshold, the point at which the GPU must make a round trip back to main memory for more data. At this point, the cost of transferring data in to main memory doubles every time we double the number of vectors we're inputting. This continues up until we get to 4mb of vectors, where we blow out both the DMA and my GPUs physical memory. At this point, the whole operation has to buffer back and forth from main memory to work.

It's interesting to see the asynchronous data transfer of the DMA at work here. If we wanted, we could do some other computations while it moved data between main memory and video memory. In this case, we did not, we waited to see how long it would take for the purpose of the benchmark. Because of its asynchronous nature, it's possible to do a "ping pong" sort of effect, where you do the computations of frame #1, start to download it and then use its results for frame #3. At frame #2, you compute against a different set and use its results at frame #4, so on and so forth.

Let's be realistic here though. If we're going to talk about simulating physics using the GPU on a MacBookPro, we're not talking millions of vectors - that would literally mean hundreds of thousands of objects bouncing and crashing about wildly all at the same time. Very impressive to look at, I'm sure, but probably not a realistic way to build a game. If you're really doing physics, you don't need real time, in which case you can use the ping-pong technique to stream larger amounts of data through the GPU. This is described in detail at www.gpgpu.org. In a real time scenario, such as in a game, you'd simulate the physics for the nearby objects, which is a kind of streaming in a way as well - or you might distribute the physics across all the GPUs of the players connected.. the options are open at this point if you really need a very large amount of computing power. As others have shown with CUDA, the approach scales very very well. For those interested - the CUDA technique uses the texture technique under the hood for most of its operations. CUDAs job is to hide all the complexities of how it works and let you get on with writing the math routine. It does a really good job of this, but I believe it's good to see how it's done .. and also to do it from Smalltalk.