Confessions of a Would-Be Theologian: September 2010

We had an interesting debate this morning in our daily Alanta standup meeting. We’re still in the process of implementing our proprietary media server (long story), and I’d noticed during a recent debugging session that with three clients connected, our media server had spawned something like 118 different threads. That didn’t seem right to me, so we had a long discussion about whether we actually needed that many threads and whether it represented a significant problem for our architecture. The answers that seemed obvious to me – no, we didn’t, and yes it was – weren’t as obvious to everyone else. We were eventually able to come to a consensus about how to proceed, but I thought I’d jot down some thoughts about when it’s appropriate to spawn threads, when it’s not, and some best practices about how to achieve the right balance in your architecture.

Why you don’t want a lot of threads

First things first: you really do want to minimize the number of threads in your application. One well known reason for doing this, of course, is that there’s overhead every time the processor has to switch between threads. The processor has to save the state of the currently executing thread, then load the previously saved state of the next thread (and if the threads are in different processes, it needs to reload the virtual address translation tables as well). This results in a practical limit of about 30 active threads per process per processor. Any more than that, and you’ll be spending more time dealing with thread overhead than with getting anything done.

However, the argument advanced this morning was that if the threads weren’t active, i.e., were just waiting to be signaled, then they weren’t chewing up CPU cycles with context switches, and hence even very large numbers of them were harmless. The problem with this argument is that threads have a significant virtual memory overhead even when they’re inactive. As Raymond Chan has pointed out, each thread you spawn needs its own stack, and by default, the stack size on a Windows box is 1 MB. This means that every time you spin up another thread, you’re adding another another 1MB to the virtual memory your process is consuming. So on a 32-bit machine, since 2GB is generally reserved for the kernel, the maximum number of threads that any one process could reasonably hope to create is ~2000. (On Linux, the default stack size is 2 MB, which introduces a practical limit of ~1000 threads per process.) Of course, this is only on 32-bit machines, and you can get away with quite a bit more on 64-bit machines, but the point remains: every thread chews up allocatable memory, and quite a lot of it. One caveat: each thread chews up 1 MB of virtual memory, not physical memory (at least, not right away). In other words, if your process creates 500 threads, you won’t necessarily see your process chewing up 500 MB of physical memory in task manager. That’s because the pages in physical memory that back your stack aren’t allocated until they’re referenced (as described here). So you can’t tell you’re running low on virtual memory until you’ve actually run out of address space, and the first indication you’ll have that you’re about to die is a strange and otherwise inexplicable Out of Memory error.

I should note that it’s theoretically possible to create a threading model that doesn’t suffer quite so badly from the stack problem noted above. Rob Behren and some compatriots wrote a great paper on the topic, in which they demonstrated a system running with over 100,000 active threads. Unfortunately, their recommendations haven’t been widely adopted by tool and framework vendors. For instance, they point out that a compiler/linker which dynamically adjusted stack sizes would lower the memory overhead for each thread pretty substantially. This is true, but I’m not aware of any mainstream compilers that actually do this.

So in brief, if you’re writing papers for a conference, create as many threads as you want. If you’re writing software with mainstream compilers, frameworks and operating systems, limit the number of threads you create.

Three reasons for creating threads

So it’s pretty clear why we don’t want a whole bunch of threads. Similarly, there are a pretty limited number of reasons for ever creating threads in the first place:

To take advantage of multiple processors. This is the classic reason. If your software will typically be running on a multiprocessor or multicore machine, and you really do have more work than one processor can handle in a timely fashion, it makes sense to split that processing up into multiple independent threads of execution.
To move specific processing off of the UI thread. This is a much more pragmatic reason. In the normal Windows world, you typically create threads so that a given background process doesn’t choke your UI. In the Silverlight world where I’ve been living lately, this is less of a problem: Silverlight forces every sort of IO into an asynchronous pattern, and that tends to keep your UI thread from blocking. But the opposite can sometimes be an issue: the UI thread can get so busy that you need to spin up background threads so important tasks (like audio encoding or decoding) can get dispatched quickly enough. But the idea is the same: you want to keep your users from getting grumpy because important parts of their application appear to have ground to a halt.
To simplify complicated asynchronous calling patterns. Unless you actually need to take advantage of multiple processors, it’s almost always possible to get the effect of threads through a completely different mechanism, namely, using events to pass control from one part of the program to another. (This is called the Lauer/Needham Duality. In a famous 1979 paper, Lauer and Needham showed that “message-oriented” and “procedure-oriented” systems – read event-driven vs. multithreaded – were duals of each other and hence were logically equivalent architectures.) But although thread synchronization is difficult to get right, it can be even more painful to write a program using entirely asynchronous calls. The APIs for doing so are often complicated and obscure, and they require that you split your program’s logic across various artificial boundaries. Depending on the complexity of your application, they may also require a that you implement a cooperative multitasking model, where any given function can be requested to yield to other functions, and only later pick up where it left off. These sorts of issues can make debugging and maintenance quite difficult. Threads are plenty complicated, but apart from the places where you start a thread, wait for a thread to finish, or lock some resource, multithreaded code looks reasonably similar to synchronous single-threaded code. And that’s almost always a good thing.

If none of these three reasons apply, you don’t need to create threads. It’s as simple as that.

Maintaining the right balance

In the real world, you’re typically going to need to balance legitimate reasons for threads against the virtues and vices of asynchronous programming. But the basic guidelines for scalable, high-performance systems are reasonably clear:

Use asynchronous IO with callbacks or events when you need to interface with the outside world. If a client connects to your server, you may be tempted to spin up a thread dedicated that client, and then let everything within that thread happen synchronously. In other words, you create a thread for each client that connects, and then in that thread, you send some data to the client, block until the client sends some back, and so forth, until finally the client disconnects, at which point you terminate the thread. Certainly that’s the simplest model, and if you’re confident that you’ll never have more than a couple dozen clients connected simultaneously, it may be a reasonable approach. But as I discussed above, it’s surprisingly memory intensive, and a load as small as a few hundred clients could bring your server to its knees (even if the server’s CPU was completely idle). A better approach is to use the asynchronous version of whatever network API you’re using, and simply handle whatever processing needs to be done in the callback. On the native Windows API, you do this using I/O completion ports; on Linux, you can use select() or poll() or (better) aio_read() and its compatriots. If you’re using C++, the best way to do it is probably to use the boost::asio namespace; and if you’re using .NET, you should use the asynchronous versions of whatever piece of the framework is applicable. (There’s a reasonable, if incomplete, example of how to use Socket.BeginAccept/Socket.EndAccept here on the MSDN site, for example.) Depending on how your framework is implemented, this may also be a good place to use a threadpool: instead of handling the event on whatever thread it comes in on, hand it off to your threadpool, and go back to waiting for the next signal/callback/event.
Don’t try to do everything via asynchronous message-passing. Threads are really a pretty helpful abstraction. I’ve seen high-performance servers implemented entirely using asynchronous, event-driven code, but they’re a pain to troubleshoot and debug. The vnc-reflector project on SourceForge is a good example of this. It’s very performant, even though it runs on a single thread. But it’s a pain to debug or modify, because so much of the logic consists of passing static function pointers around in a very complicated main() loop. Asynchronous is good, but don’t bother trying to make everything asynchronous. If you can simplify your code by kicking off a preemptively multitasked thread, rather than implementing a whole bunch of your own mechanisms for cooperative multitasking, by all means, kick off the thread.
Use a threadpool whenever you don’t know precisely how many threads you’ll be creating. If you want to dedicate a thread to some specific task that can run in the background, there’s not much point to using a threadpool. But if you need to handle an unknown n number of clients, don’t try to spawn a dedicated thread per client. Use a threadpool instead, and service the requests out of the pool. On a side note, I’ve known folks who were tempted to write their own threadpools. Granted that in their simplest versions they’re not all that complicated, but still: unless you have a damned good reason to do so, don’t. Recent versions of Windows have their own built-in threadpool APIs; .NET has a reasonable ThreadPool in the System.Threading namespace, or you can use Ami Bar’s SmartThreadPool; and although Boost strangely doesn’t have one, Philipp Henkel has made a boost::threadpool available that works cleanly with Boost threads. And there are tons of other examples. It’s just silly to try to reinvent the wheel when there are so many thorough implementations available for basically the cost of browsing to a website and looking at some sample code.
A good compromise may be to use a small number of queues to move data through your system. Queues are a classic asynchronous mechanism, but they can also be used effectively with threads and threadpools. For instance, one of the functions of a media server might be to mix audio coming in from clients in a conference call, and then send the result back to each. One of many possible architectures would be to implement this as a series of queues with a threadpool servicing each of the queues. One threadpool would parse the incoming data and then place the compressed audio data in a queue to be decompressed; a second set of threads would decompress the audio and then place it in a the next queue to be mixed; a third set of threads would mix the decompressed audio and then place the results in a final queue; a fourth threadpool would then construct the mixed audio data into packets and deliver them back to the clients. You could also implement this same basic architecture with more or less queues, depending on other constraints and design goals. And I should also note that the only reason for using threadpools rather than individual threads would be to take advantage of multiple processors. If you only had four threads servicing the three main queues, but were running this on an eight-processor box, you’d effectively be leaving four of the processors unused.

If you take this last approach, however, just make sure that your queues and especially your threadpools are limited in number. The problem that kicked off this whole blog posting is that our media server currently creates a separate set of queues for each connected client, and then creates a separate threadpool for each queue. The result is that each connection spawns 37 new threads. I don’t think it’ll be too painful to fix this: but it does need to be fixed.

We tried to go camping for Labor Day (again), but like last Labor Day, we got rained out, gave up and came back home early. Still, we had a reasonably good time while we were there.

http://picasaweb.google.com/smithkl42/20100904LaborDayCamping#

Our weekend started on Friday afternoon, when Caedmon and I drove to the Middle Fork Campground, ahead of everyone else, so that we could be sure to get our spot.

The next morning, Saturday, Caedmon and I tried our hand fishing, but didn't have any luck. Probably would have done better with flies, but it's difficult enough to get a three-year old to handle a spinning rod without tangles and snags everywhere. I'll leave the fly-fishing lessons for another day.

After fishing, but before lunch, Caedmon and I went on a hike that ended up being longer than we expected. We got about a mile and a half out before finally turning around. Caedmon did great, but was a little disappointed that we turned back before the end of the trail. I explained to him that we needed to go back to camp so we could have some food, before we got all tired and grumpy. He wanted to know why we might get tired and grumpy, so I said, "Well, it's our brain's way of telling us that we need to get more food." A while later, when we were almost back to our campsite, I was carrying him on my shoulders, and I heard him whisper to himself, "It's all right, brain. You'll get some food in a couple minutes."

Later Saturday afternoon, Galena, Brendan, Calista and Kirstin McGhee got there. Kirstin's been helping Galena with the kids most of the summer, so she knows them very well, and when her boyfriend, Tony, decided to spend Labor Day weekend in Las Vegas, she asked if she could hang out with us. That gave us the excuse we needed to try camping with three such small kids. And she was very handy to have around -- you don't want young children outnumbering adults on a camping trip.

Brendan loved the ice-cold, freezing water. Must help to have a layer of blubber for insulation.

This was Calista's first camping trip, and she seemed a little stunned by it all.

That evening, after dinner, we made s'mores, which were a substantial hit, not least because they gave Caedmon an excuse to wave a flaming torch around.

It started sprinkling just after we put the kids to bed, and by the time we were in bed ourselves, it was coming down pretty hard. It rained all night, and it turns out that 3-season tents are really preferable to 2-season tents in situations like that. I had Caedmon and Calista in my tent, and we were all dry, but Galena, Kirstin and Brendan were all pretty wet by the time morning rolled around. So we made a quick breakfast, cleaned up as best we could in the rain, and headed home. I like to think of it less as "giving up" than "staying married".

Confessions of a Would-Be Theologian

Thursday, September 16, 2010

Valid Reasons for Using Threads

Why you don’t want a lot of threads

Three reasons for creating threads

Maintaining the right balance

Monday, September 6, 2010

Rained out on Labor Day (again)

About Me