We had an interesting debate this morning in our daily Alanta standup meeting. We’re still in the process of implementing our proprietary media server (long story), and I’d noticed during a recent debugging session that with three clients connected, our media server had spawned something like 118 different threads. That didn’t seem right to me, so we had a long discussion about whether we actually needed that many threads and whether it represented a significant problem for our architecture. The answers that seemed obvious to me – no, we didn’t, and yes it was – weren’t as obvious to everyone else. We were eventually able to come to a consensus about how to proceed, but I thought I’d jot down some thoughts about when it’s appropriate to spawn threads, when it’s not, and some best practices about how to achieve the right balance in your architecture.
Why you don’t want a lot of threads
First things first: you really do want to minimize the number of threads in your application. One well known reason for doing this, of course, is that there’s overhead every time the processor has to switch between threads. The processor has to save the state of the currently executing thread, then load the previously saved state of the next thread (and if the threads are in different processes, it needs to reload the virtual address translation tables as well). This results in a practical limit of about 30 active threads per process per processor. Any more than that, and you’ll be spending more time dealing with thread overhead than with getting anything done.
However, the argument advanced this morning was that if the threads weren’t active, i.e., were just waiting to be signaled, then they weren’t chewing up CPU cycles with context switches, and hence even very large numbers of them were harmless. The problem with this argument is that threads have a significant virtual memory overhead even when they’re inactive. As Raymond Chan has pointed out, each thread you spawn needs its own stack, and by default, the stack size on a Windows box is 1 MB. This means that every time you spin up another thread, you’re adding another another 1MB to the virtual memory your process is consuming. So on a 32-bit machine, since 2GB is generally reserved for the kernel, the maximum number of threads that any one process could reasonably hope to create is ~2000. (On Linux, the default stack size is 2 MB, which introduces a practical limit of ~1000 threads per process.) Of course, this is only on 32-bit machines, and you can get away with quite a bit more on 64-bit machines, but the point remains: every thread chews up allocatable memory, and quite a lot of it. One caveat: each thread chews up 1 MB of virtual memory, not physical memory (at least, not right away). In other words, if your process creates 500 threads, you won’t necessarily see your process chewing up 500 MB of physical memory in task manager. That’s because the pages in physical memory that back your stack aren’t allocated until they’re referenced (as described here). So you can’t tell you’re running low on virtual memory until you’ve actually run out of address space, and the first indication you’ll have that you’re about to die is a strange and otherwise inexplicable Out of Memory error.
I should note that it’s theoretically possible to create a threading model that doesn’t suffer quite so badly from the stack problem noted above. Rob Behren and some compatriots wrote a great paper on the topic, in which they demonstrated a system running with over 100,000 active threads. Unfortunately, their recommendations haven’t been widely adopted by tool and framework vendors. For instance, they point out that a compiler/linker which dynamically adjusted stack sizes would lower the memory overhead for each thread pretty substantially. This is true, but I’m not aware of any mainstream compilers that actually do this.
So in brief, if you’re writing papers for a conference, create as many threads as you want. If you’re writing software with mainstream compilers, frameworks and operating systems, limit the number of threads you create.
Three reasons for creating threads
So it’s pretty clear why we don’t want a whole bunch of threads. Similarly, there are a pretty limited number of reasons for ever creating threads in the first place:
- To take advantage of multiple processors. This is the classic reason. If your software will typically be running on a multiprocessor or multicore machine, and you really do have more work than one processor can handle in a timely fashion, it makes sense to split that processing up into multiple independent threads of execution.
- To move specific processing off of the UI thread. This is a much more pragmatic reason. In the normal Windows world, you typically create threads so that a given background process doesn’t choke your UI. In the Silverlight world where I’ve been living lately, this is less of a problem: Silverlight forces every sort of IO into an asynchronous pattern, and that tends to keep your UI thread from blocking. But the opposite can sometimes be an issue: the UI thread can get so busy that you need to spin up background threads so important tasks (like audio encoding or decoding) can get dispatched quickly enough. But the idea is the same: you want to keep your users from getting grumpy because important parts of their application appear to have ground to a halt.
- To simplify complicated asynchronous calling patterns. Unless you actually need to take advantage of multiple processors, it’s almost always possible to get the effect of threads through a completely different mechanism, namely, using events to pass control from one part of the program to another. (This is called the Lauer/Needham Duality. In a famous 1979 paper, Lauer and Needham showed that “message-oriented” and “procedure-oriented” systems – read event-driven vs. multithreaded – were duals of each other and hence were logically equivalent architectures.) But although thread synchronization is difficult to get right, it can be even more painful to write a program using entirely asynchronous calls. The APIs for doing so are often complicated and obscure, and they require that you split your program’s logic across various artificial boundaries. Depending on the complexity of your application, they may also require a that you implement a cooperative multitasking model, where any given function can be requested to yield to other functions, and only later pick up where it left off. These sorts of issues can make debugging and maintenance quite difficult. Threads are plenty complicated, but apart from the places where you start a thread, wait for a thread to finish, or lock some resource, multithreaded code looks reasonably similar to synchronous single-threaded code. And that’s almost always a good thing.
If none of these three reasons apply, you don’t need to create threads. It’s as simple as that.
Maintaining the right balance
In the real world, you’re typically going to need to balance legitimate reasons for threads against the virtues and vices of asynchronous programming. But the basic guidelines for scalable, high-performance systems are reasonably clear:
- Use asynchronous IO with callbacks or events when you need to interface with the outside world. If a client connects to your server, you may be tempted to spin up a thread dedicated that client, and then let everything within that thread happen synchronously. In other words, you create a thread for each client that connects, and then in that thread, you send some data to the client, block until the client sends some back, and so forth, until finally the client disconnects, at which point you terminate the thread. Certainly that’s the simplest model, and if you’re confident that you’ll never have more than a couple dozen clients connected simultaneously, it may be a reasonable approach. But as I discussed above, it’s surprisingly memory intensive, and a load as small as a few hundred clients could bring your server to its knees (even if the server’s CPU was completely idle). A better approach is to use the asynchronous version of whatever network API you’re using, and simply handle whatever processing needs to be done in the callback. On the native Windows API, you do this using I/O completion ports; on Linux, you can use select() or poll() or (better) aio_read() and its compatriots. If you’re using C++, the best way to do it is probably to use the boost::asio namespace; and if you’re using .NET, you should use the asynchronous versions of whatever piece of the framework is applicable. (There’s a reasonable, if incomplete, example of how to use Socket.BeginAccept/Socket.EndAccept here on the MSDN site, for example.) Depending on how your framework is implemented, this may also be a good place to use a threadpool: instead of handling the event on whatever thread it comes in on, hand it off to your threadpool, and go back to waiting for the next signal/callback/event.
- Don’t try to do everything via asynchronous message-passing. Threads are really a pretty helpful abstraction. I’ve seen high-performance servers implemented entirely using asynchronous, event-driven code, but they’re a pain to troubleshoot and debug. The vnc-reflector project on SourceForge is a good example of this. It’s very performant, even though it runs on a single thread. But it’s a pain to debug or modify, because so much of the logic consists of passing static function pointers around in a very complicated main() loop. Asynchronous is good, but don’t bother trying to make everything asynchronous. If you can simplify your code by kicking off a preemptively multitasked thread, rather than implementing a whole bunch of your own mechanisms for cooperative multitasking, by all means, kick off the thread.
- Use a threadpool whenever you don’t know precisely how many threads you’ll be creating. If you want to dedicate a thread to some specific task that can run in the background, there’s not much point to using a threadpool. But if you need to handle an unknown n number of clients, don’t try to spawn a dedicated thread per client. Use a threadpool instead, and service the requests out of the pool. On a side note, I’ve known folks who were tempted to write their own threadpools. Granted that in their simplest versions they’re not all that complicated, but still: unless you have a damned good reason to do so, don’t. Recent versions of Windows have their own built-in threadpool APIs; .NET has a reasonable ThreadPool in the System.Threading namespace, or you can use Ami Bar’s SmartThreadPool; and although Boost strangely doesn’t have one, Philipp Henkel has made a boost::threadpool available that works cleanly with Boost threads. And there are tons of other examples. It’s just silly to try to reinvent the wheel when there are so many thorough implementations available for basically the cost of browsing to a website and looking at some sample code.
- A good compromise may be to use a small number of queues to move data through your system. Queues are a classic asynchronous mechanism, but they can also be used effectively with threads and threadpools. For instance, one of the functions of a media server might be to mix audio coming in from clients in a conference call, and then send the result back to each. One of many possible architectures would be to implement this as a series of queues with a threadpool servicing each of the queues. One threadpool would parse the incoming data and then place the compressed audio data in a queue to be decompressed; a second set of threads would decompress the audio and then place it in a the next queue to be mixed; a third set of threads would mix the decompressed audio and then place the results in a final queue; a fourth threadpool would then construct the mixed audio data into packets and deliver them back to the clients. You could also implement this same basic architecture with more or less queues, depending on other constraints and design goals. And I should also note that the only reason for using threadpools rather than individual threads would be to take advantage of multiple processors. If you only had four threads servicing the three main queues, but were running this on an eight-processor box, you’d effectively be leaving four of the processors unused.
If you take this last approach, however, just make sure that your queues and especially your threadpools are limited in number. The problem that kicked off this whole blog posting is that our media server currently creates a separate set of queues for each connected client, and then creates a separate threadpool for each queue. The result is that each connection spawns 37 new threads. I don’t think it’ll be too painful to fix this: but it does need to be fixed.