OpenGL 3.2 pack #1

25.01.2010 00:34 in OpenGL

Finally I've fully switched my engine code to OpenGL 3.2 (core profile). I have some experiences that I would like to share.

Performance

Well... damn, it's fast! Although GPU itself can't be accelerated much, the CPU/driver part is much faster. Most of all, count of API calls dropped significantly. Few examples:

  • setting up material: instead of 20-30 uniforms and about 5 texture changes I can now upload 1 uniform buffer and 1 texture array (diffuse/normal/specular/...)
  • drawing a mesh: was: few enable/disables, few glXxxPointers, 1 glDrawElements. Now: 1 bind of vertex array object, 1 glDrawElements.
  • updating buffers: previously bind, update, unbind. Now (thanks to EXT_direct_state_access [spec]) just NamedBufferData(buffer_id, ...). Numbers of calls to setup textures, framebuffers and other stuff could also be reduced with DSA. Drawback: no ATI support at the moment.

This may seem like "just a little optimization". But it's not -- especially if you are CPU-bound. On my main development machine with powerful GPU and rather weak CPU the difference was huge. Even up to 10ms! That's 60 FPS -> 200 FPS. On average, there is 10-40% boost.

Uniform Buffer Objects [spec]

I've been using bindable uniforms for some time. However there was a problem: specification gave no standard layout for data and even no methods to determine the layout. In the end, I've been using float4 for everything and packed data manually. That was quite cumbersome, so I've switched to new OpenGL 3.1 UBO (uniform buffer objects). There are 2 big differences between UBO and bindable uniforms [spec].

First of all, you have 3 different layouts in UBO:

  • std140 -- probably most useful. Basically you align float3/float4/structs/arrays/matrices to 16 bytes, float2 to 8 and floats to 4. That is quite OK if you sort your data from biggest to smallest. You would do so in CPU code, right?
  • shared -- data using this layout can be shared across programs, but not GPU vendors. Well, I don't think that messing with structs are worth it, it will probably be the same as std140.
  • packed -- this is an optimised layout, stripping unused variables, rearranging order and so on. But you can't share such buffer across other programs. And if you can't share it, why bother to create unused variables? :) That's mystery, and rather useless feature for me.

And the other difference is quite minor: with uniform buffers you directly bind buffers to uniforms, with UBO you bind them like textures. So you bind buffers to "slots", and bind those slots to uniforms. OpenGL makers seem to like it a lot.

Setting buffer data & bugs (?)

In my particles code I have found a very annoying bug called "random mess shows up on screen". What was wrong? Finally I've made this piece of code:

GLuint id;
glGenBuffers(1, &id);
glBindBuffer(GL_ARRAY_BUFFER, id);
glBufferData(GL_ARRAY_BUFFER, size_of_data, data, GL_STREAM_COPY);
glGetBufferSubData(GL_ARRAY_BUFFER, 0, size_of_data, data2);
if (data != data2) panic();

Of course initially there was BufferSubData instead of generating buffer. The result was a mess. No GL errors raised, but data was quite random in non-random manner -- everytime I've run the app the data was the same. What was wrong? I have absolutely no idea. I've managed with this bug by using MapBuffer instead of SubBufferData and it worked like a charm. But at least I've learned about...

...Transform Feedback [spec]

Equivalent of DirectX's Stream Out. Basically this allows to stream some data from vertex shader into a buffer. So you can for example:

  • debug your vertex shader. You can hook between vertex and fragment shader and see what is VS' output
  • save vertex results for future usedata, for example to do skinning only once per frame (in case you have shadows/reflections etc)
  • do some GPGPU calculations without OpenCL -- this way you can store structures more easily than doing your job in fragment shader

You can also disable rasterization of generated vertices (pure streaming into buffer).

Timer Query [spec]

This is very useful for profiling your OpenGL application. Because GPU & CPU are doing their jobs asynchronically, something like this is bad:

var t0 = get_current_timestamp();
glRenderFancyThings();
var t1 = get_current_timestamp();
Log("Time elapsed: %f", t1 - t0);

Another bad example. This time we wait for GPU to complete, so the result makes no sense (compared to real world usage):

var t0 = get_current_timestamp();
glRenderFancyThings();
glFinish();
var t1 = get_current_timestamp();
Log("Time elapsed: %f", t1 - t0);

However we can use Timer Query to measure every GL command and grab result after few frames, when the commands have finished.

Bonus: Fermi GPU Architecture

NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Interesting, I wonder about its performance in OpenGL/DirectX.

Comments:

  1. Riddlemaster

    Riddlemaster:

    I'm waiting for Fermi with some hopes (it can be quite a step forward) but it seems it's still not ready and that it could be delayed.

    25.01.2010 09:12:06

  2. pixelmager

    pixelmager:

    Interesting post - thanks for sharing!

    I tried out the timer_query on amd a couple of days ago, but didn't get consistent results - I still needed to do a complete pipeline-flush (glFinish) to get correct results. Commands before and after the timing still affect the timing.

    25.01.2010 11:21:20

  3. Gynvael Coldwind

    Gynvael Coldwind:

    Thanks for sharing!
    I'm lately kinda out of the gfx/game industry, and I really like that you post things like that ;> It keeps me informed hehe ;> Thanks!

    26.01.2010 23:22:33

Leave comment: