Monday, October 6, 2014

Azure: What to use, what to avoid

Azure is clearly the second-tier choice for cloud services these days, well behind Amazon Web Services, but (so far as I can tell) still well ahead of Google Cloud and all the other players. But since I’ve been building Payboard’s infrastructure on the Microsoft stack – Visual Studio 2013 is pretty nice, and C# remains my favorite language by a significant margin – Azure was a natural choice.

Since making that choice, Payboard has processed some 30 million events from our customers, with several hundred thousand more coming in every day. That’s pretty small beer compared to some folks, but it’s not insignificant, and it’s given us a chance to stress test Azure in the real world. In the process, I’ve developed some strong opinions about what works well in Azure and what I would at all costs avoid the next time around. What follows here is just the experience of one team – so caveat developor.

Azure Websites: Thumbs up

Azure Websites aren’t suited for every task, with their main limitation being that they can’t scale up beyond 10 instances.  But if you’re not going to bump up beyond that, they’re very nice. We haven’t had any reliability problems to speak of, and they have very nice rollout stories. My favorite is their git integration: once you get it setup, you just push to Github, and that’s it. Azure notices your push, builds it, runs all your unit tests, and then if they succeed, pushes it to the website automatically. Very handy, and a nice workflow.

Azure SQL: Adequate

SQL Server is a great database, and I’m not at all sorry that we went with it. But Azure SQL starts getting spendy if you’re pushing any significant traffic to it at all, and it has some weird limitations that you won’t find in standalone SQL Server. (The one that’s bit me most recently is that it doesn’t support NEWSEQUENTIALID() – good luck keeping those clustered indexes defragmented.) Like SQL Server in general, it doesn’t have a great scale-out story: you can do it, it’s just a significant PITA. And finally, Azure SQL seems to have a lot of transient connectivity errors. At least half a dozen times a day, we simply can’t connect to the DB, sometimes for upwards of several minutes. MS insists, quite correctly, that you need to wrap every attempt to write to the DB in a retry block. But sometimes the errors last longer than your retry block on a busy server can reasonably be expected to continue retrying. My recommendation: if you’re building a financial application, or any application where you simply can’t afford to lose data, don’t use Azure SQL.

Azure Table Storage: Adequate (barely)

Azure Table Storage is insanely cheap, unbelievably scalable, astonishingly reliable, and if you’re using it the way it was intended, blazingly fast. It’s also missing a whole host of vital features, and is extremely brittle and thus painful to use in the real world. It’s not quite as bad as the “write-only datastore” that I initially dismissed it as being, but it really needs some TLC from the Azure team. For a good sense of what it’s still missing after years of neglect, check out of the UserVoice forums. Despite all that, if you’re willing to repeatedly pivot and re-import your data into ATS, it can be fairly effective. It’s basically a really cheap place to dump your log data. If you need to read your log data back, you can do it, you just need to be willing to repeatedly copy your data into a whole bunch of different tables, with each table having a separate partition key/row key schema. That’s more-or-less acceptable when you have a few million rows of data; it’s a lot less workable when you’ve got a few billion. It would be better if I’d been able to get the 5-20K rows / second imports that Azure advertises; unfortunately, even after a lot of tuning, I haven’t been able to get more than (sometimes) 1000 rows per second. (I’m sure that there may be ways to do it faster – but the fact that I haven’t been able to figure it out after a lot off effort goes right back to my point about brittleness.)

Azure Service Bus: Thumbs Down

Uggh. When we decided to switch over to an asynchronous queuing architecture for our event imports, we initially went with Azure Service Bus, mostly because it was newer (and presumably better) than Azure Storage Queues, and because it offered a notification-based approach for servicing its queues. Unfortunately, it was neither reliable nor scalable enough. We suffered through repeated outages before switching over to Azure Storage Queues. In addition, it basically doesn’t have a local development story. You have to use a real Azure instance, which is annoying and a PITA if you ever need to develop disconnected. (MS does have a Service Bus instance that you can install on your local machine, but at least as of this writing, it’s badly out-of-sync with the Azure implementation, and doesn’t work with the latest client library off of nuget.)

Azure Storage Queues: Thumbs Up

Very fast, very scalable, and rock solid. It’s poll-only, but that’s not hard to wrap. It has very large maximum queue sizes, which mostly makes up for the fact that its maximum message size is only 64K. On the whole, recommended.

Azure Worker Roles: Thumbs Up

They do what we need them to do. I still think that they’re more difficult to use than they need to be – I wish I could use Kudu with them, to enable the same “push to git” workflow that works so nicely with Azure Websites  – but I guess I don’t mind the flexibility that comes with requiring me to go through a separate publishing step. And once you get them configured, they’re easy to scale up and down. (I especially like the option to scale them up or down automatically based on queue size.)

Azure Managed Cache: Thumbs Down

Azure’s in-house cache implementation is slow and unreliable. When we finally abandoned it, we were experiencing multiple outages a day, and even when it was working, we were averaging about 300 ms / lookup, which was unacceptably slow for a cache. Not recommended.

Azure Redis Cache: Adequate

Redis is, indeed, as blazingly fast as you’ve heard. Our lookups often take less than 10 ms, which is kind of hard to believe, when you consider network latency and everything else. Unfortunately, after an initial period of stability, we’ve lately been having several (brief) outages a day. It’s just a cache, and we’ve wrapped our Redis cache with a (briefer)  in-memory cache, so this hasn’t been crippling, but it’s not what you like to see. In addition, I have some gripes with the recommended StackExchange redis libraries – the basic problem being that they don’t provide any automatic reconnect after a connection issue. Yes, you can wrap that, but it seems like the sort of thing that ought to be handled for you by the library itself.