Friday, July 25, 2014

“502 Bad Gateway” error on Azure Websites

I ran into a strange problem the other day. It made perfect sense as soon as I understood what was going on, but I was scratching my head about it for a while, so I thought I’d document it here in case anybody else runs into it.

I recently switched Payboard’s website over to use the new Azure redis cache, instead of the native Azure Managed Cache Service. We’d been running into all sorts of problems with the Azure Managed Cache, with requests to the cache averaging over 400 ms – way, way, way too slow for a cache system. Once we switched to redis, our average request time dropped to under 25 ms. Not perfect, but much better.

The only problem was that certain requests to the website started (randomly, it seemed) returning “502 bad gateway” errors. This had me quite puzzled. I was able to definitively track it down to our use of redis – as soon as I switched back to a simple in-memory cache, the errors disappeared. And when I switched back to redis, they started showing up again. However, none of these errors were showing up in our site’s error logs – they weren’t getting caught by any of the error handling attributes that we decorate all our controllers with, like so:

/// <summary> /// Log unhandled exceptions /// </summary> public class PayboardApiErrorHandlerAttribute : ExceptionFilterAttribute { private static readonly Logger logger = LogManager.GetCurrentClassLogger(); public override void OnException(HttpActionExecutedContext exceptionContext) { try { var exception = exceptionContext.Exception; var exceptionMessage = exception.CompleteMessage(); string userName = null; if (HttpContext.Current != null) { userName = HttpContext.Current == null ? "" : HttpContext.Current.User.Identity.Name; } var origin = ""; var ctx = exceptionContext.ActionContext; if (ctx.Request.Headers.Contains("Origin")) { origin = ctx.Request.Headers.GetValues("Origin").FirstOrDefault(); } var content = ctx.Request.Content.ReadAsStringAsync().Result; var ipAddress = ctx.Request.GetClientIp(); var message = string.Format("Controller: {0}; Url: {1}; User: {2}; Origin: {3}; IpAddress: {4}; Error: {5}; Content: {6}", exceptionContext.ActionContext.ControllerContext.Controller.GetType().Name, exceptionContext.ActionContext.Request.RequestUri, userName, origin, ipAddress, exceptionMessage, content); var prefix = string.IsNullOrWhiteSpace(userName) ? "Anonymous WebApi Error: " : "User-visible WebApi Error: "; } catch (Exception ex) { logger.Error("Error logging error - and yes, that's as circular as it seems: " + ex.CompleteMessage()); } base.OnException(exceptionContext); } } }

Because we weren’t seeing any errors in these controllers, that made me think that it must be something on Azure’s side. After the bad experience we’d had with the Azure Managed Cache, and given that Azure’s redis cache offering is still in preview, it seemed a real possibility. I just couldn’t figure out why an error in the redis caching system was causing our site to return a “bad gateway” error. The only explanation I could come up with was that perhaps both our site and the redis cache were using the same ARR instance, and maybe ARR was getting confused. That didn’t seem likely, but it was the only idea I had.

But after some playing around, I was able to reproduce the error – or at least, an error – on my dev machine, on the same pages. And it turns out that it was my fault (it usually is).

The problem is that I was using JSON.NET for serializing objects that I was sending to the cache. Unfortunately, I had it configured incorrectly.  I was setting my JSON serializer settings like this:

var jsonSerializerSettings = new JsonSerializerSettings { ReferenceLoopHandling = ReferenceLoopHandling.Serialize };

And it should have been like this:

var jsonSerializerSettings = new JsonSerializerSettings { ReferenceLoopHandling = ReferenceLoopHandling.Serialize, PreserveReferencesHandling = PreserveReferencesHandling.All };

In other words, I was telling it to serialize reference loops, but I wasn’t telling it to only include one instance of each object reference in the graph. As a result, whenever I tried to serialize (say) an EntityFramework object graph that included reference loops (for instance, a Customer object which contained a list of Events, each of which in turn contained a reference back to its parent Customer object), it would try to serialize the whole damn infinite loop, which resulted in a stack overflow.

And here’s the key. You can’t catch a stack overflow exception. You just can’t. It simply crashes your app domain. You don’t even get a chance to log it. Your app just goes bye-bye. Of course, IIS recovers from this and continues to serve more requests, but it plays all-to-hell with the current request. But this is the key part: if you’ve got a load balancer sitting in front of your web app, the load balancer isn’t even going to see an HTTP 503 error, or any sort of HTTP error. It’s just going to see that its TCP connection to the web server dropped. And in that case, it only has one option, to return a “502 Bad Gateway” error to the browser.

As soon as I figured that out, I fixed my serialization code, moving it all into a separate class so that I could unit test it, and all was well.

At any rate, that’s a bit of a long story, but the key lesson is this: if your website is simply failing to return requests (without even a 5xx error), or if the load balancer in front of your website is returning “502 bad gateway” errors, suspect a stack overflow exception somewhere in your code. That, and write more unit tests.

That’s all. Carry on.