Friday, November 4, 2011

My favorite C# micro-optimizations #2: 1D Arrays

This is my second post on my favorite C# micr0-optimizations. Everything I said in the first post applies to this one as well, namely, don’t optimize until you know that you need to; and spend most of your optimization budget optimizing the algorithms, rather than figuring out which way to access an array is faster.

That said, this was a surprise to me: jagged arrays (e.g., arrays of arrays) are much faster than rectangular arrays: but indexed access into a 1D array is faster yet.

In other words, this is relatively slow:

   1:  for (int x = 0; x < size1; x++)
   2:  {
   3:      for (int y = 0; y < size2; y++)
   4:      {
   5:          result = twoDArray[x, y];
   6:      }
   7:  }

This is significantly faster:

   1:  for (int x = 0; x < size1; x++)
   2:  {
   3:      for (int y = 0; y < size2; y++)
   4:      {
   5:          result = jaggedArray[x][y];
   6:      }
   7:  }

But this is the fastest:

   1:  for (int x = 0; x < size1; x++)
   2:  {
   3:      for (int y = 0; y < size2; y++)
   4:      {
   5:          result = oneDArray[x * size2 + y];
   6:      }
   7:  }

Benchmark results on my machine (1000 iterations through a 1000x1000 array):

2DArrayTest action completed: iteration = 5, completionTime = 3111, averageCompletionTime = 3101.400
JaggedArrayTest action completed: iteration = 5, completionTime = 2283, averageCompletionTime = 2256.000
1DArrayTest action completed: iteration = 5, completionTime = 1698, averageCompletionTime = 1638.800

It’s also very much worth noting that the order in which the iteration happens is important. Sequential memory access is faster than purely random memory access, most likely because data is fetched from main memory to the CPU’s cache in chunks, so that subsequent sequential reads are from cache.

In other words, if instead of this at the center of each loop:

result = jaggedArray[x][y];
result = twoDArray[x, y];
result = oneDArray[x * size2 + y];

We switch the order of the indexes, so that memory is accessed out-of-order (this is called “diagonal access”):

result = jaggedArray[y][x];
result = twoDArray[y, x];
result = oneDArray[y * size1 + x];
Then the tests take more than twice as long:

JaggedArrayTest action completed: iteration = 5, completionTime = 7499, averageCompletionTime = 7676.400
2DArrayTest action completed: iteration = 5, completionTime = 6790, averageCompletionTime = 6818.600
1DArrayTest action completed: iteration = 5, completionTime = 5862, averageCompletionTime = 5883.200

It’s interesting that in a diagonal access pattern, rectangular arrays are indeed faster than jagged arrays: but one-dimensional arrays are still the fastest.

Thursday, November 3, 2011

My favorite C# micro-optimization #1: Buffer.BlockCopy()

So a caveat first-off. Micro-optimizations are evil. You almost never need them. You can live the vast majority of your life as a programmer and never use them; and most of the time you use them, you’re going to use them incorrectly. If your program is running slow, you’ll almost never find some tiny little micro-optimization that will fix it.

But there are times when they’re handy, and with all the work I’ve been doing the last couple years on a custom Silverlight media stack, I’ve had a number of opportunities to learn about how to speed up code that gets called thousands of times each second. This series of articles is the result of that experience.

And my first suggestion is to use Buffer.BlockCopy() when you need to move data in and out of arrays. You don’t normally need to do this very heavily, but when you’re dealing with real-time multimedia, you’ll find yourself doing it all the time, so the faster you can do it, the better.

I mention this because a common way to move data from one array to another is through a for loop, like so:

   1:  for (int i = 0; i < sourceArray.Length; i++)
   2:  {
   3:      destinationArray[i] = sourceArray[i];
   4:  }

Unfortunately, that’s very nearly the slowest way to do it. (Array.Clone() is actually slower, but that’s another story.) A much better way to do it is to use Array.Copy(). Indeed, this is normally the preferred method, since you don’t have to worry about the size of the individual elements, which is an easy place to mess up. Moreover, Array.Copy() works for arrays of any type, instead of just intrinsic value types.

   1:  Array.Copy(sourceArray, 0, destinationArray, 0, sourceArray.Length);

But as it turns out, Buffer.BlockCopy() is slightly faster for many operations, since it doesn’t have to perform certain checks at the beginning of the copy:

   1:  Buffer.BlockCopy(sourceArray, 0, destinationArray, 0, sourceArray.Length * sizeof(short));

Benchmark results on my machine (10 million copies of a 64-element short[] array):

Buffer.BlockCopy action completed: iteration = 5, completionTime = 420, averageCompletionTime = 422.000
Array.Copy action completed: iteration = 5, completionTime = 478, averageCompletionTime = 482.600
forLoop action completed: iteration = 5, completionTime = 1092, averageCompletionTime = 1093.000

That said, the difference between Array.Copy() and Buffer.BlockCopy() tends to disappear the larger the amount of data to copy (while a for loop falls further and further behind). If you’re copying 64,000 elements instead of 64 elements, these are the results:

Buffer.BlockCopy action completed: iteration = 5, completionTime = 1251, averageCompletionTime = 1229.000
Array.Copy action completed: iteration = 5, completionTime = 1254, averageCompletionTime = 1218.200
forLoop action completed: iteration = 5, completionTime = 11063, averageCompletionTime = 11082.200

But there’s one significant caveat: if you only need to move a few elements, even the minimal overhead of Buffer.BlockCopy() can be too much. If you need to move less than 32 elements, you’ll probably find that a straightforward for loop (or perhaps an unrolled version of it) is the fastest. Of course, you’ll want to do your own testing to find out where the break-even point is. But here are my results with 1,000,000 copies of a 16-element array:

Buffer.BlockCopy action completed: iteration = 5, completionTime = 367, averageCompletionTime = 363.800
Array.Copy action completed: iteration = 5, completionTime = 418, averageCompletionTime = 420.000
forLoop action completed: iteration = 5, completionTime = 291, averageCompletionTime = 290.000

Tuesday, November 1, 2011

Fast (approximate) Sqrt method in C#

As part of the video codec that Alanta uses, we need to calculate the variation between blocks in one frame and the equivalent blocks in the next frame. I’ve been using an algorithm proposed by Thiadmer Riemersma that looks something like this:

public static double GetColorDistance(byte r1, byte g1, byte b1, byte r2, byte g2, byte b2)
{
    int rmean = (r1 + r2) / 2;
    int r = r1 - r2;
    int g = g1 - g2;
    int b = b1 - b2;
    int weightR = 2 + rmean / 256;
    const int weightG = 4;
    int weightB = 2 + (255 - rmean) / 256;
    return Math.Sqrt(weightR * r * r + weightG * g * g + weightB * b * b);
}

It works pretty well, but that last step depends on a square root calculation, which is relatively slow; and when this is something you need to run on every pixel in a frame, you want it to be as fast as possible. Consequently, I’ve been looking at ways to optimize it.

The important thing to note is that for my purposes, close is good enough: I don’t need IEEE precision. It turns out that there’s a pretty good approximation that’s available in languages like C or C++ which let you do unsafe casts back and forth between ints and floats:

float sqrt_approx(float z)
{
    union
    {
        int tmp;
        float f;
    } u;
    u.f = z;
    u.tmp -= 1 << 23; /* Subtract 2^m. */
    u.tmp >>= 1; /* Divide by 2. */
    u.tmp += 1 << 29; /* Add ((b + 1) / 2) * 2^m. */
    return u.f;
}
The problem with this approach is that C# doesn’t normally let you play this kind of magic.
The key word being “normally”, of course.
Turns out there’s one trick you can use to make C# treat the same memory address as either an int or a float, and that’s to create a struct with a [StructLayout(LayoutKind.Explicit)] attribute. (And surprisingly enough, the trick works in Silverlight as well.) The resulting class looks like this:
public class Approximate
{
    public static float Sqrt(float z)
    {
        if (z == 0) return 0;
        FloatIntUnion u;
        u.tmp = 0;
        u.f = z;
        u.tmp -= 1 << 23; /* Subtract 2^m. */
        u.tmp >>= 1; /* Divide by 2. */
        u.tmp += 1 << 29; /* Add ((b + 1) / 2) * 2^m. */
        return u.f;
    }

    [StructLayout(LayoutKind.Explicit)]
    private struct FloatIntUnion
    {
        [FieldOffset(0)]
        public float f;

        [FieldOffset(0)]
        public int tmp;
    }
}
The results are pretty good: it’s more than twice as fast, and the results tend to be within 2% of the “real” answer:

MathSqrtTest: averageCompletionTime = 2424.000
ApproxSqrtTest: averageCompletionTime = 1058.000
Average variation: 0.0253587081881525

If you want a bit more accuracy at the cost of an additional CPU cycle or two, you can use this one, based on the famous inverse square root method in Quake 3:

public static float Sqrt2(float z)
{
    if (z == 0) return 0;
    FloatIntUnion u;
    u.tmp = 0;
    float xhalf = 0.5f * z;
    u.f = z;
    u.tmp = 0x5f375a86 - (u.tmp >> 1);
    u.f = u.f * (1.5f - xhalf * u.f * u.f);
    return u.f * z;
}
It’s a tad slower than the first (though still nearly 2x as fast as Math.Sqrt()), but much more accurate:

MathSqrtTest: averageCompletionTime = 2428.400
ApproxSqrt2Test: averageCompletionTime = 1361.000
Average variation for method2: 0.000928551700594071