Keep data close, part 1

17.05.2011 02:52 in programming, C++

I would like to present a very cool technique, widely used in C libraries, yet almost completely forgotten in C++.

We have a String class. It is a reference-counted, immutable string. Typical data structures in C++ would look like:

struct string
{
  string_data * data;
};
 

struct string_data
{
  int refcount;
  int length;
  char * data;

};

So, string is merely a reference to actual string_data. Creating new string objects (string foo = "hello world") looks like:

  1. string_constructor(const char *)
  2. allocate string_data
  3. string_data_constructor(const char *)
  4. allocate char[]
  5. copy data, set all other members

In memory, it would actually look like:

So, we have one stack allocation, two dynamic allocations and our data is in three different places in memory. But in C++ terms it seems impossible to improve it. However, we can go C way and change string_data definition:

struct string_data
{
  int refcount;
  int length;

  char data[1];
};

Array of one character? I must be crazy, right?

Most C++ courses are not very precise about pointers and arrays. Especially, what is the difference of following instructions?

const char * string1 = "hello world";

const char string2[] = "hello, C++!";

And what is the difference of those?

int foo(const char *) { printf("*\n"); }

int foo(const char[]) { printf("[]\n"); }

Unfortunately, in C++ world where really various things can be static, answers are different. In first case, string1 and string2 are something else. In second case, there is no difference — actually, that code won’t compile due to non-unique overload.

Maybe little investigation? Let’s print some data:

printf("%p %p %s\n", string1, &string1, string1);
printf("%p %p %s\n", string2, &string2, string2);

What we do is printing expression as a pointer, address of this expression as a second pointer and null-terminated string pointed by expression. Results:

0x08048650 0xbfddbfa4 hello world
0xbfddbf98 0xbfddbf98 hello, C++!

Now you see magic behind it. 0xbf…… are stack addresses while 0×08048650 comes from somewhere else (likely from static, read-only data section). So, while pointer types hold address inside, arrays points to themselves. And we can abuse it!

“Typical” C++ allocation of string_data would look like this:

string_data * foo = new string_data;

It would allocate sizeof(string_data) bytes (9 + padding, probably 12). But in fact, we can allocate as much as we want! For example:

string_data * q = (string_data*)malloc(2 * sizeof(int) + 12);

new (q) string_data;

In this case — 2 ints and 12 bytes of string data. After successful allocation we need to manually invoke constructor, using placement new. Similarly, regular delete q won’t work — following code will:

q->~string_data();
free(q);

Of course if data type is simple (built-in datatypes, relaxed POD, whatever) there is no need to call constructor and destructor (although those calls probably would be simply optimized away). And malloc/free can be replaced with custom allocation routines.

And after such allocation, we can use data like it was not a char[1] array but rather a char[as much as we have allocated] array! How does memory look like with this data structure?

And there are numerous benefits of such approach:

  • one dynamic allocation less
  • data is less scattered in memory
  • we saved some memory (how much? size of the pointer + size of allocation metadata)

In the next part, I’ll show how we can save even more allocations/memory.

(from #altdevblogaday)

Cross-platform system info and why Windows rocks

13.05.2011 14:59 in sysinfo, programming

Recently I'm creating a base library for all my gamedev adventures. It contains both very low-level features (allocator, string, vector, hashmap, errors etc) and reasonably middle-level features (like HTTP requests). Grabbing system information is rather high-level but very important, especially for OpenGL developer. It's very helpful if you can inform user that he just needs to update drivers.

Typical output:

[23:12:26] CPU : 8 x 2806 (2806) MHz (Intel(R) Core(TM) i7 CPU         930  @ 2.80GHz, FPU MMX SSE SSE2 SSE3 SSSE3 EST SSE4.1 SSE4.2 POPCNT HTT)
[23:12:26] RAM : 7956MB/12278MB
[23:12:26] GPU : NVIDIA GeForce GTX 460 [8.17.12.6658, 1-7-2011] :: NVIDIA Corporation GeForce GTX 460/PCI/SSE2 using OpenGL 4.1.0
[23:12:26] OS  : Windows 7 (6.1 Service Pack 1) 64 bit

Many folks complain on Windows API, it's backwards compatiblity and overall misery. But while working on sysinfo program, I've actually learned how powerful it is. Few examples:

Memory

On Windows there is GlobalMemoryStatusEx function that returns accurate results, in 64-bit format even for 32-bit applications.

On Linux there is /proc/meminfo but its results are very hard to interpret and are different than results of free command. In the end, I've used free -mo | head -n 2 | tail -n 1. Pure magic (and pure hope that free output format won't change).

Mac OSX amused me. As you may or may not know, I'm an iOS developer. I thought that Mac will be quite similar to its iOS counterpart, maybe with API like [[NSSystem sharedSystem] freeMemory]. Hell no! To obtain memory statistics, you need to access Mach (Mac OSX kernel) layer, like this:

mach_port_t host_port;
mach_msg_type_number_t host_size = sizeof(vm_statistics_data_t) / sizeof(integer_t);
vm_size_t pagesize;
vm_statistics_data_t vm_stat;

host_port = mach_host_self();
host_page_size(host_port, &pagesize);        
   
if (host_statistics(host_port, HOST_VM_INFO, (host_info_t)&vm_stat, &host_size) != KERN_SUCCESS) return;

natural_t mem_used = (vm_stat.active_count + vm_stat.inactive_count + vm_stat.wire_count) * pagesize;
mem_free = vm_stat.free_count * pagesize;
mem_total = mem_used + mem_free;
    
mem_free /= 1024 * 1024;
mem_total /= 1024 * 1024;

GPU

On all platforms I query OpenGL for vendor, renderer and both (OGL and GLSL) version strings. But this doesn't include driver information. And each vendor/platform has his own vision of those strings.

So, on Windows I use SetupAPI to obtain precise GPU information. With SetupDiEnumDeviceInfo I search for FILE_DEVICE_VIDEO devices. This is useful, because it allows to check multi-GPU configurations or to bypass additional layers of (OpenGL) emulation (like for example screen capturing applications). And with SetupAPI there is standarized driver information available.

But on Linux it's not so easy. I have no Linux-specific programming knowledge (not that I've used SetupAPI before). I didn't even search for kernel API for this. I've just used lspci command and searched for VGA compatible controller. For NVIDIA cards I've found that I can obtain driver info from /proc/driver/nvidia/version. For ATI or other vendors -- I have no idea. Fortunately, ATI seems to include driver information in OpenGL version string (like 3.3.10237 Compatibility Profile Context) so this should not be a problem. But if you know better solution, please let me know.

On Mac OSX I didn't even try. In fact you can't install or update GPU drivers on Mac, because those are not provided by NVIDIA or other IHV, but by Great Apple itself -- and that also means that OpenGL stopped at 2.1 version. Hate you, Apple. There is a magic number in OpenGL version string: 2.1 NVIDIA-1.6.26 but I have no idea how could I compare this to Windows/Linux driver version.

Conclusions

There is other data as well, but as complex as examples above. For example to obtain OS data there is GetVersionEx and GetNativeSystemInfo in WinAPI (Native flavour returns 64-bit information for 32-bit apps on 64-bit Windows). On Linux/OSX I just use uname -a (Mac OSX have no variations, and on Linux it would be hard because each distribution has its own vision of versioning). CPU was fortunately mostly cross-platform, because I focus on __cpuid data (BTW -- do you know how to obtain HTT and/or number of cores from __cpuid?).

So my verdict is: Windows rocks, other platforms are pure chaos. At least in grabbing system information. But really, WinAPI may be awkward or "too backward compatible" but at least is simple and powerful.

BTW, if you want to test sysinfo program on your PC, you can grab Windows, Linux or Mac OSX version and share results in comments. This is not the most recent version but results would be helpful.

Is programming an art?

07.05.2011 11:35 in rant

I've found an intersting post on Twitter: You are NOT a Software Engineer. While I don't agree in general, it can bring us to a question: is programming an art?

Let's start with defining what is art. For me, art is a process of creating something where, at any time, you have full control of the output. For example, designing bridges is not an art, because you can't just remove a pillar here and add an arch there.

In more IT terms, subjectively:

  • Doing 2D in Photoshop is an art. Even with multiple layers involved, you can always add another one on top and paint over to achieve pleasing effect.
  • Sculpting in Z-brush is an art. At any moment you can add additional tentacle to your model, even since most of the time you shouldn't.
  • Modelling in 3DS-max is an engineering. You need to maintain good topology, corrects edge loops and proper smooth groups. You can't easily modify finished model.

Of course I'm not telling that 3D artists are not artists. :) Art can (and usually does) involve engineering, and vice versa. It's just subtle yet important matter of balance. And we can observe that due to reduced technical complexity of doing ,,art'', it's more accessible to traditional artists. I mean, give talented painter a tablet and Photoshop and he will succeed. And Z-brush sculpting is just adding another dimension to it. On the other hand, there is no ,,off-line'' version of polygonal modelling.

So how can we treat programming with such definitions?

Unfortunately, programming is not an art. Most of the time you can't change features without effort. It's very apparent in gamedev where cost of code changes (esp. last-minute code changes) is much bigger than design/art.

I would love programming to be an art. But we need much better languages and tools to achieve it.

Binary shaders -- not that big of a deal

03.05.2011 01:12 in opengl

People complain that OpenGL lacks some important features. One of them is using precompiled binary shaders. Recently OpenGL got an ability to reuse compiled shader -- but you still have to deliver source version of your shaders.

But this isn't really a problem. Why? Let's compile this simple shader with Nvidia Cg using different profiles.

in float var;
float4 main() : COLOR
{
  if (var > 0.5)
    return float4(1 - var, 0, 0, 1);
  else
    return float4(0, sin(var), 0, 1);
}

First, arbfp1 assembly (roughly SM 2.0 equivalent):

#const c[0] = 0 1 0.5
PARAM c[1] = { { 0, 1, 0.5 } };
TEMP R0;
TEMP R1;
TEMP R2;
SLT R0.x, c[0].z, fragment.texcoord[0];
ABS R0.x, R0;
CMP R2.x, -R0, c[0], c[0].y;
MOV R1.xzw, c[0].xyxy;
MOV R0.yzw, c[0].xxxy;
SIN R1.y, fragment.texcoord[0].x;
ADD R0.x, -fragment.texcoord[0], c[0].y;
CMP result.color, -R2.x, R1, R0;
END

As ARB shader assembler has no branching, both paths are executed and correct version is selected at the end. Now gp4fp version (~SM 4.0):

TEMP R0;
TEMP RC, HC;
OUTPUT result_color0 = result.color;
SGT.F R0.x, fragment.attrib[0], {0.5};
TRUNC.U.CC HC.x, R0;
IF    NE.x;
  MOV.F result_color0.yzw, {0, 1}.xxxy;
  ADD.F result_color0.x, -fragment.attrib[0], {1};
ELSE;
  MOV.F result_color0.xzw, {0, 1}.xyxy;
  SIN.F result_color0.y, fragment.attrib[0].x;
ENDIF;
END

This time assembly generated is very similar to original code. If you play with compiling shaders on different profiles, you'll also notice that in newer profiles like gp4*p optimizations are not pushed so hard. So my point is that:

  1. newer shader assembly profiles are way more complex than old SM 2.0
  2. and feature quite complex control flow instructions (branches, loops etc)
  3. many optimizations are left to the driver
  4. all advanced GLSL features (like UBO) would have to be implemented in assembler

Basically, GLSL assembler would like much like optimized GLSL source code with control flow slightly modified and variables renamed to r0...rN. But you can do it yourself! :) I've seen a least one game that used GLSL pre-optimization (same shader codebase was used for desktop and mobile OpenGL ES versions). And I think that there are bigger issues to address by Khronos than precompiled shaders.

On the other hand, an ability to dump compiled shader and reuse it later is a real killer. Also, in current deferred days number of shaders permutations is way lower than it used to be...

(from #altdevblogaday)