OpenGL 3.2 pack #1

25.01.2010 00:34 in OpenGL

Finally I've fully switched my engine code to OpenGL 3.2 (core profile). I have some experiences that I would like to share.

Performance

Well... damn, it's fast! Although GPU itself can't be accelerated much, the CPU/driver part is much faster. Most of all, count of API calls dropped significantly. Few examples:

  • setting up material: instead of 20-30 uniforms and about 5 texture changes I can now upload 1 uniform buffer and 1 texture array (diffuse/normal/specular/...)
  • drawing a mesh: was: few enable/disables, few glXxxPointers, 1 glDrawElements. Now: 1 bind of vertex array object, 1 glDrawElements.
  • updating buffers: previously bind, update, unbind. Now (thanks to EXT_direct_state_access [spec]) just NamedBufferData(buffer_id, ...). Numbers of calls to setup textures, framebuffers and other stuff could also be reduced with DSA. Drawback: no ATI support at the moment.

This may seem like "just a little optimization". But it's not -- especially if you are CPU-bound. On my main development machine with powerful GPU and rather weak CPU the difference was huge. Even up to 10ms! That's 60 FPS -> 200 FPS. On average, there is 10-40% boost.

Uniform Buffer Objects [spec]

I've been using bindable uniforms for some time. However there was a problem: specification gave no standard layout for data and even no methods to determine the layout. In the end, I've been using float4 for everything and packed data manually. That was quite cumbersome, so I've switched to new OpenGL 3.1 UBO (uniform buffer objects). There are 2 big differences between UBO and bindable uniforms [spec].

First of all, you have 3 different layouts in UBO:

  • std140 -- probably most useful. Basically you align float3/float4/structs/arrays/matrices to 16 bytes, float2 to 8 and floats to 4. That is quite OK if you sort your data from biggest to smallest. You would do so in CPU code, right?
  • shared -- data using this layout can be shared across programs, but not GPU vendors. Well, I don't think that messing with structs are worth it, it will probably be the same as std140.
  • packed -- this is an optimised layout, stripping unused variables, rearranging order and so on. But you can't share such buffer across other programs. And if you can't share it, why bother to create unused variables? :) That's mystery, and rather useless feature for me.

And the other difference is quite minor: with uniform buffers you directly bind buffers to uniforms, with UBO you bind them like textures. So you bind buffers to "slots", and bind those slots to uniforms. OpenGL makers seem to like it a lot.

Setting buffer data & bugs (?)

In my particles code I have found a very annoying bug called "random mess shows up on screen". What was wrong? Finally I've made this piece of code:

GLuint id;
glGenBuffers(1, &id);
glBindBuffer(GL_ARRAY_BUFFER, id);
glBufferData(GL_ARRAY_BUFFER, size_of_data, data, GL_STREAM_COPY);
glGetBufferSubData(GL_ARRAY_BUFFER, 0, size_of_data, data2);
if (data != data2) panic();

Of course initially there was BufferSubData instead of generating buffer. The result was a mess. No GL errors raised, but data was quite random in non-random manner -- everytime I've run the app the data was the same. What was wrong? I have absolutely no idea. I've managed with this bug by using MapBuffer instead of SubBufferData and it worked like a charm. But at least I've learned about...

...Transform Feedback [spec]

Equivalent of DirectX's Stream Out. Basically this allows to stream some data from vertex shader into a buffer. So you can for example:

  • debug your vertex shader. You can hook between vertex and fragment shader and see what is VS' output
  • save vertex results for future usedata, for example to do skinning only once per frame (in case you have shadows/reflections etc)
  • do some GPGPU calculations without OpenCL -- this way you can store structures more easily than doing your job in fragment shader

You can also disable rasterization of generated vertices (pure streaming into buffer).

Timer Query [spec]

This is very useful for profiling your OpenGL application. Because GPU & CPU are doing their jobs asynchronically, something like this is bad:

var t0 = get_current_timestamp();
glRenderFancyThings();
var t1 = get_current_timestamp();
Log("Time elapsed: %f", t1 - t0);

Another bad example. This time we wait for GPU to complete, so the result makes no sense (compared to real world usage):

var t0 = get_current_timestamp();
glRenderFancyThings();
glFinish();
var t1 = get_current_timestamp();
Log("Time elapsed: %f", t1 - t0);

However we can use Timer Query to measure every GL command and grab result after few frames, when the commands have finished.

Bonus: Fermi GPU Architecture

NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Interesting, I wonder about its performance in OpenGL/DirectX.

OpenGL extensions on NVidia & ATI

16.01.2010 19:54 in OpenGL, extensions

I've made a simple diff of supported OpenGL extensions on NVidia & ATI cards. Lists were done in GPU caps viewer -- 3.2 compability (default) profile. Vendor-specific exts are marked with color background.

The good news: bindable uniforms.

The bad news: no direct state access on ATI. I hope this is going to be implemented soon!

Thx Krajek for ATI extensions. :)

See it here:

Read more...

glslDevil: PIX for OpenGL?

26.11.2009 15:13 in OpenGL

If you use OpenGL you must check out this free debugger: glslDevil.

Quoting autors, glslDevil is a tool for debugging the OpenGL shader pipeline, supporting GLSL vertex and fragment programs plus the recent geometry shader extension. By transparently instrumenting the host application it allows for debugging GLSL shaders in arbitrary OpenGL programs without the need to recompile or even having the source code of the host program available. The debug data is directly retrieved from the hardware pipeline and can be used for visual debugging and program analysis.

I'm going to test it in near future, hoping it could handle OGL 3.x code (gDEBugger fails to do so and doesn't support Win7).

OpenGL 3.1 explicit_multisample VS OpenGL 3.2 multisample textures

23.09.2009 12:04 in OpenGL

I wanted to compare NV_explicit_multisample and new multisample textures that are now part of OpenGL 3.2 core.

Results:

  • both methods give exactly same results (pixel accurate)
  • FPS is also exactly the same
  • multisample textures are little easier to use

To use multisample textures, you need to change just few lines in your FBO code:

Non-MS:

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D, tex);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, 1024, 768, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D, tex, 0);

BTW - here's the tip: 7th and 8th parameters in glTexImage2D (GL_RGBA, GL_UNSIGNED_BYTE in this case) doesn't matter if you pass NULL pointer to data. They are used only to convert data to texture format (so you can upload integer data to float texture). However they must be valid enums, otherwise function will fail.

And now MS version:

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_2D_MULTISAMPLE, tex);
glTexImage2DMultisample(GL_TEXTURE_2D_MULTISAMPLE, 4, GL_RGBA32F, 1024, 768, false);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_TEXTURE_2D_MULTISAMPLE, tex, 0);

Pretty easy, huh? And to bind this texture to sampler, you only need to change GL_TEXTURE_2D to GL_TEXTURE_2D_MULTISAMPLE:

glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D_MULTISAMPLE, tex);

So now the fragment shader part. Take code from my previous post and make following changes:

  • #extension GL_NV_explicit_multisample : enable is obviously unnecessary
  • change all samplerRenderbuffer to sampler2DMS
  • change texelFetchRenderbuffer to texelFetch
  • change textureSizeRenderbuffer to textureSize

That's it. I think it's quite easy to use.

BTW. I've made various versions of my GI demo. I've made also a OpenGL 3.1 version that should run on all DX10 cards (NV & ATI). I'd be glad if you test it (and tell me it actually works).

What is and why do we need explicit_multisample? (or how to do real antialiasing in deferred shading)

16.09.2009 12:08 in 3D graphics, OpenGL

Deferred shading has lately become extremely popular. I’m not huge fan of it, but depending on typical scene in game (preferably indoor, lot of lights) it can be a great advantage. However, antialiasing is a real pain in DS case. Most gamed involved edge filter combined with blur, but the result is visually horrible (especially in low resolutions, where AA is a must). But why can’t we use multisample (MSAA/CSAA) with deferred shading?

Let’s see how multisample works. Up to now, we:

  • render the scene
  • downsample AA buffer to texture
  • render full-screen quad with texture (and probably some postprocess)

This of course won’t do the thing right with deferred shading. Why? Because it will downsample each G-buffer individually. See following picture.

p1.png

We have 4 pixels, 4 samples each (I won’t go into multisample details, let’s keep it simple) - a normal vector is stored in each sample. We downsample AA buffer and poof! Normals have gone wrong. Everything else will follow the same routine, so at edges we will have blurred normals/diffuse values and other data. Using AA will probably only boost visual artifacts.

But, OpenGL 3.0 and DirectX 10 has a new feature which is called explicit multisample (or custom resolve). It allows us to access each sample in multisample buffer. In this scenario, we don’t downsample AA buffer - we use it like a texture, so in lighting shader we have access to every normal/diffuse, and our computations look like the second picture.

p2.png

And we still benefit from multisampling (instead of supersampling). Time for some C++.

What do we need to do to upgrade our rendering? First, buffers creating:

glGenTextures(1, &tex);
glBindTexture(GL_TEXTURE_RENDERBUFFER_NV, tex);
glGenRenderbuffers(1, &buffer);
glBindRenderbuffer(GL_RENDERBUFFER, buffer);
glRenderbufferStorageMultisample(GL_RENDERBUFFER, 8, GL_RGBA32F, 1024, 768);
glTexRenderbuffer(GL_TEXTURE_RENDERBUFFER_NV, buffer);
glFramebufferRenderbuffer(GL_FRAMEBUFFER, GL_COLOR_ATTACHMENT0, GL_RENDERBUFFER, buffer);

And then, binding texture for FSQ:

glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_RENDERBUFFER_NV, tex);
glTexRenderbuffer(GL_TEXTURE_RENDERBUFFER_NV, buffer);
glUniform1i(sampler, 0);

Finally, let’s fix shader code. Assume we have following code:

#version 150
sampler2D sampler_diffuse, sampler_position, sampler_normal;
in vec2 texcoord; // [0,1]x[0,1]
out vec4 result;
vec4 compute_lighting(vec3 diffuse, vec3 position, vec3 normal)
{
  ...
}
void main()
{
  vec3 diffuse = texture2D(sampler_diffuse, texcoord).rgb;
  vec3 position = texture2D(sampler_position, texcoord).xyz;
  vec3 normal = texture2D(sampler_normal, texcoord).xyz;
  result = compute_lighting(diffuse, position, normal);
} 

We upgrade it to:

#version 150 
#extension GL_EXT_gpu_shader4 : enable
#extension GL_NV_explicit_multisample : enable
samplerRenderbuffer sampler_diffuse, sampler_position, sampler_normal;
in vec2 texcoord; // [0,1]x[0,1]
out vec4 result;
vec4 compute_lighting(vec3 diffuse, vec3 position, vec3 normal)
{
  ...
}
void main()
{
  const int samples = 8;
  result = vec4(0); 
  ivec2 texcoord2 = ivec2(textureSizeRenderbuffer(sampler_diffuse) * texcoord);
  for (int i = 0; i < samples; i++)
  {
    // AA renderbuffers are addressed with integers
    vec3 diffuse = texelFetchRenderbuffer(sampler_diffuse, texcoord2, i).rgb;
    vec3 position = texelFetchRenderbuffer(sampler_position, texcoord2, i).xyz;
    vec3 normal = texelFetchRenderbuffer(sampler_normal, texcoord2, i).xyz;
    result += compute_lighting(diffuse, position, normal);
  }
  result /= (float)samples;
} 

That’s it! There are various impovements we can do. For example, if we use shadow mapping, we can calculate shadow term per-pixel and then apply it to all samples. And we must hope that ATI would implement OpenGL 3.2 (and explicit multisample) soon.

Update: there is ARB_texture_multisample (now part of OpenGL core) that should do the same thing and be more portable. I'm going to check differences between this and nv_explicit_multisample soon!

OpenGL 3.2 functions loading on Win32

14.09.2009 11:55 in OpenGL

I've made simple headers to be able to use OpenGL 3.2 core functions. Here they are:

Usage is quite simple. First, download gl3.h. Then, in your context creation source file insert

#include <gl3.h>
#define EXTERNFLAG 
#include "gl3header.h"

in headers, and after the context has been created, do something like this (this one is for SDL)

#define GERPROCADDRESSPROC SDL_GL_GetProcAddress
#include "gl3loading.h"

Everywhere else, just include this:

#include <gl3.h>
#define EXTERNFLAG extern 
#include "gl3header.h"

I think you get the idea. Enjoy.