std::vector or C-style array?

I recently overheard a rather interesting statement concerning programming and thought I’d share it with the world via a Tweet and a small counter example. This started an interesting discussion:

What I believed would go by unnoticed, spawned a series of interesting views on this topic – some of them missing my original point. Therefore, I figured it’d be a good idea to expand my thought into something longer than 140 characters. Note however, that this is not a full dissection of pros and cons of one solution over the other but a general opinion!

Original statement and code example

So the original statement claims that you should always use std::vector instead of a C-style array for data containment. The keyword “always” is what triggered my Tweet, claiming that in every instance it’s a desired and better solution. To see how this works out, here’s an example – a short program to create a static container of 10 integers, assign them values and return. First, let’s use the “dogmatic” approach:

// approach 1: using std::vector
#include <vector>

int main() { 
  std::vector<int> foo;
  foo.resize(10);
  
  for(int i = 0; i < 10; i++)
	foo[i] = i;
  
  return 0;
}

Alternatively, let’s do the same thing with a classic C-style array.

// approach 2: using C array
int main() {
  int foo[10];
  
  for(int i = 0; i < 10; i++)
	foo[i] = i;
  
  return 0;
}

Running both pieces of code through a compiler (unoptimized – will soon explain why) creates interesting results when looking at the assembly. Approach 1 produces output too long to be put in this post, however with approach 2 we get this neat code:

main:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], 0
.L3:
        cmp     DWORD PTR [rbp-4], 9
        jg      .L2
        mov     eax, DWORD PTR [rbp-4]
        cdqe
        mov     edx, DWORD PTR [rbp-4]
        mov     DWORD PTR [rbp-48+rax*4], edx
        add     DWORD PTR [rbp-4], 1
        jmp     .L3
.L2:
        mov     eax, 0
        pop     rbp
        ret

“You forgot the -O flag!”

It’s true that the optimization flag can be passed to a compiler and in all likelihood, this is what you’ll be working with most of your time. There are, however, few specific reasons I deliberately choose not to demonstrate code with optimizations enabled:

1. Most developers work primarily with debug code during development process. While your mileage may vary, chances are that your debug session will be done with the use of debug symbols most of the time.
2. Optimizations (even for debug builds) cost you build time.
3. DON’T RELY ON A COMPILER IF YOU CAN WRITE BETTER CODE YOURSELF.

Why you should care?

The reason I put emphasis on the last sentence is because a lot of modern day programmers are taught to delegate all the heavy work to the compiler with an assumption that it will always turn out fine. What this belief leads to is crawling debug code in the best case. Take, for example, the first approach using -O3 and the counter example, also with -O3. My personal belief is that you should rely on the -O flag only as last resort (and even then take it with a grain of salt). To that extent, using complex structures such as std::vector, std::array and others should be only done when it’s justified and doesn’t incur unnecessary performance penalties on the executed code. It is not, in fact, the high count of assembly lines that’s the main problem, rather the heap allocations that take place when using STL containers. In many scenarios this is more penalizing than anything else, so unless a dynamic sized array is really needed, it’s not worth following established dogmas (whether STL is good for game programming is a completely different story).

Who should I do?

Don’t get too obsesive about the topic! I think it’s good practice to get your code to be as fast as possible without using excessive compiler optimizations but on the other hand – it is, after all, a tool that is handed to the programmer. And always be a bit cautious and sceptical when somebody tells you that solution A is “always” better than B – chances are, there’s a solution C that bests them all! 😉

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr

Rendering in VR using OpenGL instancing

TL;DR; download code sample from GitHub!

In all of my VR applications thus far, I’ve been using separate eye buffers for rendering, seeing it as a convenience. Recently, however, I started wondering how I could improve drawing times and reduce unnecessary overhead, so my attention turned toward single render target solution and how it could take advantage of instanced rendering. Here’s a short summary of my results.

To briefly recap, there are two distinct ways you can use to render to the HMD (in this particular case I’ll be focusing on Oculus Rift):

1. Create two render targets (one per eye) and draw the scene to each one of them accordingly.
2. Create a single, large render target and use proper viewports to draw each eye to it.

The details on how both of these can be achieved are not specified, so it’s up to the programmer to figure out how to get both images. Usually, the first idea that comes to mind is to simply recalculate MVP matrix for each eye every frame and render the scene twice, which may look like this in C++ pseudocode:

for (int eyeIndex = 0; eyeIndex < ovrEye_Count; eyeIndex++)
{
    // recalculate ModelViewProjection matrix for current eye
    OVR::Matrix4f MVPMatrix = g_oculusVR.OnEyeRender(eyeIndex); 

    // setup scene's shaders and positions using MVPMatrix
    // setup of HMD viewports and buffers goes here
    (...)

    // final image ends up in correct viewport/render buffer of the HMD
    glDrawArrays(GL_TRIANGLE_STRIP, 0, num_verts);
}

This works fine but what we’re essentially doing is doubling the amount of draw calls due to rendering everything twice. With modern GPUs this may not necessarily be that big of a deal, however the CPU <-> GPU communication quickly becomes the bottleneck as the scene complexity goes up. During my tests, trying to render a scene with 2500 quads and no culling resulted in drastic framerate drop and GPU rendering time increase. With Oculus SDK 1.3 this can, in fact, go unnoticed due to asynchronous timewarp but we don’t want to deal with performance losses! This is where instancing can play a big role in gaining significant boost.

In a nutshell, with instancing we can render multiple instances (hence the name) of the same geometry with only single draw call. What this means is we can draw the entire scene multiple times as if we were doing it only once (not entirely true but for our purposes we can assume it works that way). So the amount of draw calls is reduced by half in our case and we end up with code that may look like this:

// MVP matrices for left and right eye
GLfloat mvps[32];

// fetch location of MVP UBO in shader
GLuint mvpBinding = 0;
GLint blockIdx = glGetUniformBlockIndex(shader_id, "EyeMVPs");
glUniformBlockBinding(shader_id, blockIdx, mvpBinding);

// fetch MVP matrices for both eyes
for (int i = 0; i < 2; i++)
{
    OVR::Matrix4f MVPMatrix = g_oculusVR.OnEyeRender(i);
    memcpy(&mvps[i * 16], &MVPMatrix.Transposed().M[0][0], sizeof(GLfloat) * 16);
}

// update MVP UBO with new eye matrices
glBindBuffer(GL_UNIFORM_BUFFER, mvpUBO);
glBufferData(GL_UNIFORM_BUFFER, 2 * sizeof(GLfloat) * 16, mvps, GL_STREAM_DRAW);
glBindBufferRange(GL_UNIFORM_BUFFER, mvpBinding, mvpUBO, 0, 2 * sizeof(GLfloat) * 16);

// at this point we have both viewports calculated by the SDK, fetch them
ovrRecti viewPortL = g_oculusVR.GetEyeViewport(0);
ovrRecti viewPortR = g_oculusVR.GetEyeViewport(1);

// create viewport array for geometry shader
GLfloat viewports[] = { (GLfloat)viewPortL.Pos.x, (GLfloat)viewPortL.Pos.y, 
                        (GLfloat)viewPortL.Size.w, (GLfloat)viewPortL.Size.h,
                        (GLfloat)viewPortR.Pos.x, (GLfloat)viewPortR.Pos.y, 
                        (GLfloat)viewPortR.Size.w, (GLfloat)viewPortR.Size.h };
glViewportArrayv(0, 2, viewports);

// setup the scene and perform instanced render - half the drawcalls!
(...)
glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, num_verts, 2);

There’s a bit more going on now, so let’s go through the pseudocode step by step:

// MVP matrices for left and right eye
GLfloat mvps[32];

// fetch location of MVP UBO in shader
GLuint mvpBinding = 0;
GLint blockIdx = glGetUniformBlockIndex(shader_id, "EyeMVPs");
glUniformBlockBinding(shader_id, blockIdx, mvpBinding);

// fetch MVP matrices for both eyes
for (int i = 0; i < 2; i++)
{
    OVR::Matrix4f MVPMatrix = g_oculusVR.OnEyeRender(i);
    memcpy(&mvps[i * 16], &MVPMatrix.Transposed().M[0][0], sizeof(GLfloat) * 16);
}

Starting each frame, we recalculate MVP matrix for each eye just as before. This time, however, it is the only thing we do in a loop. The results are stored in a GLfloat array, since this will be the shader input when drawing both eyes (4×4 matrix is 16 floats, so we need 32 element array to store both eyes). The matrices will be stored in a uniform buffer object, so we need fetch the location of the uniform block before we can perform the update.

// update MVP UBO with new eye matrices
glBindBuffer(GL_UNIFORM_BUFFER, mvpUBO);
glBufferData(GL_UNIFORM_BUFFER, 2 * sizeof(GLfloat) * 16, mvps, GL_STREAM_DRAW);
glBindBufferRange(GL_UNIFORM_BUFFER, mvpBinding, mvpUBO, 0, 2 * sizeof(GLfloat) * 16);

// at this point we have both viewports calculated by the SDK, fetch them
ovrRecti viewPortL = g_oculusVR.GetEyeViewport(0);
ovrRecti viewPortR = g_oculusVR.GetEyeViewport(1);

// create viewport array for geometry shader
GLfloat viewports[] = { (GLfloat)viewPortL.Pos.x, (GLfloat)viewPortL.Pos.y, 
                        (GLfloat)viewPortL.Size.w, (GLfloat)viewPortL.Size.h,
                        (GLfloat)viewPortR.Pos.x, (GLfloat)viewPortR.Pos.y, 
                        (GLfloat)viewPortR.Size.w, (GLfloat)viewPortR.Size.h };
glViewportArrayv(0, 2, viewports);

// setup the scene and perform instanced render - half the drawcalls!
(...)
glDrawArraysInstanced(GL_TRIANGLE_STRIP, 0, num_verts, 2);

First, we update the UBO storing both MVPs with new calculated values, after which we get to rendering part. Contrary to DirectX, there’s no trivial way to draw to multiple viewports using single draw call in OpenGL, so we’re taking advantage of a (relatively) new feature: viewport arrays. This, combined with the gl_ViewportIndex attribute in a geometry shader will allow us to tell glDrawArraysInstanced() which rendered instance goes into which eye. Final result and performance graphs can be seen on the following screenshot:


Test application rendering 2500 unculled, textured quads. Left: rendering scene twice, once per viewport. Right: using instancing.

Full source code of the test application can be downloaded from GitHub.

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr

My experiences going Rust from C++

I’ve been experimenting with Rust for over 6 months now. Most of that time I spent playing around with a C64 emulator I wrote as a first project and initially I thought about creating a series on that topic. However, since there’s so much reading material on the Internet about it already, I figured maybe it would be a good idea to write an intro for C/C++ programmers on Rust. But then I found this article, so I decided to take a completely different route.

In this post I wanted to outline the problems/quirks I ran into when transitioning from C++ to Rust. This by no means indicates that the language is poorly constructed – it’s just a matter of putting yourself in a completely different mindset, since Rust is really more than it seems at first glance and it has traps of its own if you try to code in it “C-style”. At the time of writing this, the latest stable version of the compiler is 1.7.0, so some things might get outdated with time. If you’ve been programming for a while and are considering trying out Rust, here are some things worth being wary of as you start:

1. Data ownership model

The first thing I had to learn is how variables in Rust are not variables at all but rather bindings to specific data. As such, the language introduces the concept of ownership in a sense that data can be bound to one and only one “variable” at a time. There are good examples in the link above of how that works, so I won’t be going into details here. The reason this caused me so much problem is that referring to other struct members and recurring function calls have to be thought through very carefuly when writing a program in Rust. Gone is the idea of throwing pointers everywhere and reusing it when you see it fit – once data in Rust is borrowed you have to finish doing with it what you want in order to reclaim it somewhere else in the code. It’s an interesting concept, one that surely provides some extra safety measures which other languages lack, nevertheless it takes a while to get accustomed to it.

2. No inheritance

The lack of a basic inheritance model in Rust forced me to duplicate some parts of the code. To give an example, the C64 has two timer chips which are essentialy the same thing – for emulation purposes differing in only one function. A natural instinct here is to create a single struct and just overload that particular function but Rust has no mechanism for it. The closest thing that met my needs was a trait but what I really needed was “a trait with properties”. If your design relies heavily on OOP you should either rethink it or choose a different language.

3. No cyclic references

Having started to code my emulator “C-style”, I decided to go for a clear structure that would define the entire computer:

struct C64
{
   sid_chip: SID,  // audio   
   vic_chip: VIC,  // graphics
   cpu_chip: CPU,  // processor
   cia1_chip: CIA, // timer 1
   (...)
}

Each member variable is a simple struct type. The design was clean and satisfying, so I happily started hacking at the code implementing each chip in turn.

Halfway in my work I realized I made a terrible mistake.

It turned out that all components of the C64 struct will have to communicate with each other directly in certain situations, so I needed some sort of a “bus” component. I really wanted to avoid creating an artificial structure for that purpose which eventually would introduce annoyances of its own. Global variables spread over all modules was not an option I wanted to use either.

It was a problem I couldn’t initially solve for a couple of reasons: Rust doesn’t provide any mechanism for struct field objects to communicate with the parent and you can’t just pass a reference to parent since that breaks the Rust ownership rules. Eventually I found a solution by embedding each chip structure into an Rc nested RefCell. With this I was able to use cloned instances as separate references which I would then pass during each chip’s construction. In simplest term, this solution provides a behavior similar to smart pointers, so even though a clone of the first instance is being referenced it still deals with the same data as the original copy. Once all instances are destroyed (or dropped using Rust terminology) the memory is freed completely, so it’s safe from memory leaks.

4. No explicit NULL value

Being a language set on safety, Rust disallows creating an object without initializing each of its member variables. This means no NULL pointers which you can set at a later time, so I was stuck with a new problem after introducing RefCells:

struct VIC
{
   cpu_ref: Rc<RefCell<CPU>>,  // reference to the CPU object   
   (...)
}

impl VIC
{
    // construction of VIC
    pub fn new_shared() -> Rc<RefCell<VIC>> {
            Rc::new(RefCell::new(VIC {
                cpu_ref: CPU::new_shared(), // creating a shared instance of CPU - because we have to
                (...)
                }))
    }
}


struct CPU
{
   vic_ref: Rc<RefCell<VIC>>,  // reference to the VIC chip object   
   (...)
}

impl CPU
{
    // construction of CPU
    pub fn new_shared() -> Rc<RefCell<CPU>> {
            Rc::new(RefCell::new(CPU {
                vic_ref: VIC::new_shared(),  // this causes a problem!
                (...)
                }))
    }
}

What’s happening above is once the CPU is constructed it will force the creation of VIC which will in turn create another CPU and so on, resulting in infinite recurrence. This is where std::option comes into play, being the closest thing to a NULL value in Rust:

struct VIC
{
   cpu_ref: Option(Rc<RefCell<CPU>>),  // now it's optional
   (...)
}

impl VIC
{
    // construction of VIC
    pub fn new_shared() -> Rc<RefCell<VIC>> {
            Rc::new(RefCell::new(VIC {
                cpu_ref: None,  // no infinite recurrence - will set the reference later on
                (...)
                }))
    }
}

My only gripe with this approach was that I had to specifically create a set_references() function for each type and since each struct had different references it couldn’t be neatly solved with a more generic trait.

5. Rust macros are not what you think at first!

The natural way of thinking about macros when coming from a C background is “replace this expression with elaborately syntaxed code”. Not suprisingly, Rust takes a completely different approach, deeming (quite rightfully) plain text substitution as unsafe and error prone. After switching to shared RefCell instances I faced the problem of obfuscated syntax when trying to access the actual underlying data:

// attempting to get inside the Rc<RefCell<CPU>> from within VIC struct.
// imagine typing that every single time when you need it!
self.cpu_ref.as_ref().unwrap().borrow_mut().set_vic_irq(true);

Unlike C, a macro in Rust is treated as a syntactic structure and as such has limitations of its own. You can’t access properties of an object nor can you use the self keyword to simplify your code further:

macro_rules! as_ref {
    ($x:expr) => ($x.as_ref().unwrap().borrow_mut())
}

(...)

// same code using a Rust macro - as short as it could get
as_ref!(self.cpu_ref).set_vic_irq(true);

(...)

While I understand the reasoning behind making a macro the way it is, I still find it a bit dissapointing not being able to use the less safe C-style variant.

6. Type wrapping is technically undefined

This is a language trait I’m a bit on the fence with. In C, once you go over beyond the data type scope you automatically wrap – a feature that’s been extensively used in 8-bit computers as well. At the time of writing this, data wrapping in Rust is undefined and will cause a panic! in debug builds. Wrapping data is possible but requires additional boilerplate:

(...)

self.some_8bit_variable.wrapping_add(1); // safely wraps 255 -> 0 when addition overflows

(...)

While it’s fine that Rust explicitly tells us where data wrapping is meant to happen, I’d still want to be able to manually turn that feature off for the sake of more compact code.

7. Type explicitness everywhere

Depending on the point of view, one of the biggest flaws/merits of C and C++ is implicit type conversion when assigning variables to each other, so you can “safely” assign a char to an int and pretty much expect the code to work, as long as you know what you’re doing. Also, let’s disregard for a second that we’re pragmatic progammers who adhere to compiler warnings – my practice shows that when it comes to data precision they’re mostly ignored (or completely turned off!).

So the thing is, Rust disallows assigning different types of variables to each other unless you specifically cast one type to another. The syntax of such a cast, however, I found to be slightly cumbersome to use especially if I had to perform several casts during one operation (adding bytes, casting them to words to perform a shift, then going back to byte again etc.). This is something one has to get used to, but in my early code this was the major cause of bugs:

// EXAMPLE 1
// relative memory addressing mode in C64 code excerpt: fetching operand
// bugged code: wrong relative offset calculated
fn get_operand_rel() -> u8
{
    let offset = mem.next_byte() as i8;  // memory offset should be treated as a signed char
    let addr: i16 = cpu.prog_counter + offset as u16; // BUG!
    mem.read_byte(addr as u16) // address in mem is stored as u16, so have to cast it *again*
}

// correct code: (took quite a while to track the bug down!)
fn get_operand_rel() -> u8
{
    let offset = mem.next_byte() as i8;
    let addr: i16 = cpu.prog_counter as i16 + offset as i16; // Correct! Casting both to i16
    mem.read_byte(addr as u16)
}

// EXAMPLE 2
fn foo()
{
    let var = mem.next_byte() as u8;
    let var_word: u16 = (var as u16) << 8; // would probably look neater as (u16)var << 8
    (...)
    
    // this is legal Rust code!
    let var2 = 10 as usize as u16 as u8 as u32 as f64;
}

I admit – a lot of this is me being subjective with my own preferences but the point is that if you’re used to extensive casting you may run into trouble with understanding your code. On the other hand, this may encourage programmers into breaking more complicated operations into steps for the sake of clarity. Seeing how mixed current Rust codebases are, I’m not so sure this will soon happen, though.

8. Forget OOP – go functional

Disregarding OOP as the silver bullet of programming is not uncommon today as people realize how many problems that model creates once you go deeper. Cache hits and misses, convoluted relationships between different classes and sometimes over-the-top patterns would be on top of the list. As I got more experienced with Rust it became clear that functional programming is its main focus. If you decide to write an application, you may have to forget quite a few things you know from C++. Use functions. Use modules. Use structs but don’t rely heavily on OOP patterns to assure communication between objects. In the end it will only make you happy as the code becomes a lot more readable and easier to navigate – and this comes from a person who made his first Rust application entirely in Emacs!

Try Rust. You will enjoy it! 🙂

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr

Sparse matrices and projection calculations

If you ever worked with high performance 3D applications you know that every cycle counts. One of the issues programmers try to solve is reducing computation time when dealing with matrices and vectors, especially if calculations are done very frequently each frame. Here’s a little trick that can save you some memory and cycle counts when determining projection. Consider a typical projection matrix P:

[ A, 0, 0, 0 ]
[ 0, B, 0, 0 ]
[ 0, 0, C, D ]
[ 0, 0, E, 0 ]

Projection is essentially just a linear transformation in homogenous coordinate space. If vertex coordinates are already expressed in view space as a 4-element vector (Vv):

Vv = [vx, vy, vz, 1]

then getting clip space homogenous coordinates (Vh) is simply:

Vh = Vv * P = [ a, b, c, w ]

We then divide a, b and c by w to get the regular 3D coordinates. There’s a small optimization that can be done here, provided that:

1. Most elements of object’s transformation/modelview matrix are 0 (this is called a sparse matrix). If this is the case, performing additional projection transformation using matrix multiplication is unnecessary and creates computation and memory overhead.
2. The sparse transformation/modelview matrix has it’s last row equal to [ 0, 0, 0, 1 ].
3. The object’s transformation/modelview matrix is non-projective (we wouldn’t have to bother otherwise!).

If condition 1 is met, then we only need 4 elements to determine projection (instead of a matrix), as long as we’re dividing by a positive constant: [ A/D, B/D, C/D, E/D ].

Condition 2 implicates that for proper model -> view vertex transformation we only need three 4-element vectors instead of a 4×4 modelview matrix. We can then get the view space coordinates of a vertex by performing dot products. Below is a simplified vertex shader code of how the optimization can be applied (in actual codebase I used it to reduce sizes of bone animation matrices, so the full code is slightly more complex):

#version 100

attribute highp vec3 inVertexPos;
uniform   highp vec4 inModelView[3]; // first three rows of modelview matrix
uniform   highp vec4 inProjection;   // projection matrix elements expressed as: [A/D, B/D, C/D, E/D]

void main()
{
    // transform from model space to view space
    highp vec3 viewSpacePos;
    viewSpacePos.x = dot(inModelView[0], vec4(inVertexPos, 1.0));
    viewSpacePos.y = dot(inModelView[1], vec4(inVertexPos, 1.0));
    viewSpacePos.z = dot(inModelView[2], vec4(inVertexPos, 1.0)); 

    // calculations using view space vertex position
    (...)

    // transform from view space to clip space
    gl_Position = viewSpacePos.xyzz * inProjection;
    gl_Position.z += 1.0;
}

Note that depending on how you define your projection matrix in your codebase, it might be necessary to flip signs, ie.:

(...)

uniform   highp vec4 inProjection; // projection matrix elements expressed as: [-A/D, -B/D, -C/D, -E/D]

void main()
{
    (...)
    gl_Position.z -= 1.0;
}

This neat trick, as well as many others came from PowerVR Performance Recommendations which is an excellent source for mobile graphics developers.

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr

Templates and C-style array size

If you’re dealing with templates a lot in your C++ code, then you’re likely familiar with how template type deduction works. It’s an extensive topic which I’m not going to cover in detail here but while reading this book I found one aspect of it quite useful in my work.

Remember how you sometimes needed to know the size of C-style arrays? One way to determine it was doing something like this:

const int cArray[] = { 1, 2, 3, 4, 5 };

std::size_t aSize = sizeof(cArray) / sizeof(cArray[0]);

Nothing wrong with doing it this way but thanks to templates and the way they deduce data types we can get the array size in a much cleaner manner. To briefly recap on C-style arrays, even though you can specify a function signature using this syntax:

// how we may declare a function
void foo(int arr[])
{
}

What we essentially get is an implicit pointer conversion by the compiler:

// what compiler actually sees
void foo(int *arr)
{
}

This poses a problem, since there doesn’t seem to be an easy and obvious way to “extract” array size from the function variable. But there is a solution, one that employs templates. If you read up on how template type deduction works, you’ll learn that C-style array (and function pointers) is a special case treated differently depending on how you declare a template function:

– If a template function takes a non-reference and non-pointer parameter, deduced type for C-array is a pointer to its first element
– If a template function takes a reference parameter, deduced type for C-array is the actual array type

So in practice, what happens is:

const int cArray[] = { 1, 2, 3, 4, 5 };

template<typename T>
void foo(T arg)
{
}

template<typename T>
void foo2(T& arg)
{
}

foo(cArray);  // T will be deduced as int*
foo2(cArray); // T will be deduced as int[5]

The case of T &arg produces an interesting implication, since it allows us to directly access the size of an array from within the template function. For this, however, the argument has to be slightly modified:

// get C-style array size in a simple, painless way
template<typename T, std::size_t N>
constexpr std::size_t arraySize(T(&)[N]) noexcept
{
    return N;
} 

// Usage example:
const int cArray[] = { 1, 2, 3, 4, 5 };
std::size_t aSize = arraySize(cArray); // aSize is now 5

The constexpr keyword allow us to directly initialize static array sizes with the function’s result, while noexcept gives the compiler a chance for additional optimization. While this solution doesn’t help much when dealing with dynamically allocated arrays, it’s fun to know that C++ templates, usually regarded as code obfuscators, can make programming cleaner in certain applications.

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr

Using C unions in high level code

I always considered the C union to be highly underappreciated ugly-child that nobody cares about in high level programming. Not really meant for persistent storage (specific cases excluded), it’s difficult to see any good use for such constructs, especially for beginner programmers. Unless you deal with compilers or close to the metal development, chances are you have barely used or spotted a union in an application’s runtime code. But unions can come in handy, sometimes in quite unexpected ways. I often forget about their applications, so hopefully this post will help me remember in the future.

So what can we do with them? Consider a situation, where we would like to pack one data type into another. Specifically, assume that we want to pack low precision uint8_t 4D vector coordinates into a single uint32_t variable. What most people think of doing first is taking each respective coordinate and OR it with the target using bit shifting operations. This would work fine but thanks to unions we can provide a nice, cleaner code for this purpose:

    uint32_t packVector(uint8_t x, uint8_t y, uint8_t z, uint8_t w)
    {
        union
        {
            uint32_t m_packed;
            uint8_t  m_unpacked[4];
        } u; // 4 bytes 

        u.m_unpacked[0] = x;
        u.m_unpacked[1] = y;
        u.m_unpacked[2] = z;
        u.m_unpacked[3] = w;

		return u.m_packed;
	}

A completely different problem that a union can easily solve is floating point endian swapping. You’d usually run into this when dealing with different CPU architectures and binary data stored in a file. Depending on the supported platforms you may need to juggle between little and big endian and in case of floats this might incur some performance issues if you decide to use integer endian swapping with “classic” type casting. But all those problems dissapear if instead of casting you use an union:

	// assuming that float is 32 bit for simplicity
    float swapFloat(float value)
    {
        union
        {
            uint32_t m_int;
            float    m_float;
        } u;

		u.m_float   = value;
		u.m_integer = integerEndianSwap(u.m_integer); // no penalty

		return u.m_float;
	}

See this article for a detailed explanation on why this is generally a better approach.

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr

Finding an alternative to std::bitset

In one of my games I ran into a seemingly simple problem: saving a puzzle state (ie. completed/not completed) for each of 105 available levels. Naturally first thing that came to mind was a static bool array with required amount of entries – both easy to maintain and write to disk without hassle:

bool m_levelComplete[105];  // array of 105 bools (in most cases: 8 * 105 bits)

That would serve my purpose well but I felt like expanding on the problem a bit. In my particular case, the array would take 105 bytes (or 840 bits). Not too much, but the excessive redundancy felt “dirty”, so I decided to tackle this issue and find a different way. Second solution that came to my mind was using a std::bitset object:

#include <bitset>

std::bitset<105> m_levelComplete;  // better in terms of size but still "not quite right"

A lot better in terms of storage, std::bitset felt like the thing I needed… with one (well, two, as it turned out later) exceptions: it was not trivial to serialize and store to disk (which I didn’t like) and it produced erratic behavior on one of the target platforms (Nintendo 3DS). For various reasons I also needed to know the underlying type of variables that the bits were “stored in”, so this particular implemenation didn’t fit my purpose. Eventually I started implementing a simple bitset/bit vector of my own:

#define BS_NUM_ELEM (size_t)(numBits / (sizeof(storageType) * 8) + numBits / (sizeof(storageType) * 8) % 2)

template<typename storageType, size_t numBits> class BitSet
{
    storageType m_bits[BS_NUM_ELEM];
};

For starters, I encapsulate m_bits array of desired data type. In order to satisfy the amount of bits, I calculate the required amount of array elements using the BS_NUM_ELEM macro. As you can see, the number of elements will be always rounded up, so we get nicely aligned data. For example: using unsigned long long for underlying type (and assuming it’s 64 bit in size), we get m_bits[2], so essentialy 128 available bits. Since this is a regular C-style array it will be easily saved to file but also size redundancy will be low (depending on which data type you use, the redundancy can get even lower). Having this very basic structure, I needed a way to set, clear and access individual bits:

#define BS_NUM_ELEM (size_t)(numBits / (sizeof(storageType) * 8) + numBits / (sizeof(storageType) * 8) % 2)

template<typename storageType, size_t numBits> class BitSet
{
public:
    // set bit value
    void Set(size_t bitNo)
    {
        for (size_t i = 0; i < BS_NUM_ELEM; ++i)
        {
            if (bitNo < (i + 1) * sizeof(storageType) * 8)
            {
                m_bits[i] |= ((storageType)0x01 << (bitNo - i * sizeof(storageType) * 8));
                break;
            }
        }
    }

    // clear bit value
    void Clear(size_t bitNo)
    {
        for (size_t i = 0; i < BS_NUM_ELEM; ++i)
        {
            if (bitNo < (i + 1) * sizeof(storageType) * 8)
            {
                m_bits[i] &= ~((storageType)0x01 << (bitNo - i * sizeof(storageType) * 8));
                break;
            }
        }
    }

    // access bit value
    bool operator[](size_t bitNo)
    {
        for (size_t i = 0; i < BS_NUM_ELEM; ++i)
        {
            if (bitNo < (i + 1) * sizeof(storageType) * 8)
            {
                return ((m_bits[i] >> (bitNo - i * sizeof(storageType) * 8)) & 0x01) && 0x01;
            }
        }

        return false;
    }

private:
    storageType m_bits[BS_NUM_ELEM];
};

The code should be fairly self explanatory (I omitted constructors for clarity). First we determine in which stored variable our desired bit resides. Once it’s located, we set/clear proper bit in said variable. Accessing the bit is done easily with operator[] – note how I return a boolean which simply determines whether desired bit is non-zero.

 // Usage
BitSet<unsigned long long, 104> bv; // create an "aligned" 128 bit set stored in two ULL variables.
printf("%d", sizeof(bv)); // 16 bytes/128 bits
bv.Set(24); // set bit 24
bv.Set(103); // set bit 103

...

Full template code can be downloaded from: https://github.com/kondrak/cpp_tools/tree/master/bitset

Tweet about this on TwitterShare on RedditShare on LinkedInShare on FacebookShare on Google+Share on Tumblr