Monday, January 19, 2015

R in Production Diaries: Timeouts

On Friday I received 150+ timeout notifications from our production Linux R box.

The timeout is set at a monstrous 15 seconds, and so far we are only running 20% of all transactions through the box, so the timeouts are a material, yet sporadic issue.

The situation is even more weird in that the timeouts around the process aren't occurring at the R box, but instead in an internal service layer executes well prior to calling R.  In essence, the underlying issue has nothing to do with R in production.  

I knew the issue wasn't in the R box two months ago, but was only recently able to convince IT to put the logging on their service layer that would prove it (I had logging in R from the beginning).  To quote one IT director
We know what your logs say, but still there's 99.9% this HAS to be in your process.
I feel like I should have given a lesson in statistical certainty, but that's for another time.

So.  Why did IT think it HAD to be a problem with R?  I have a few thoughts:

  1. They are all Microsoft/C# developers, and seem skeptical of the one Linux box in production (The OS doesn't even have a GUI, and was FREE!)
  2. R is a new product to them, and as such, it's the first place they look for problems.
  3. There seemed to be an assumption that each call to the R box was conducting massive statistical computations (kind of true, but nothing that modern hardware can't handle in <1 second).
Reasons 1 and 2 are just how developers (especially Microsoft developers) view new, open source products.  Reason 3 is actually a misconception about the process, and I think it's a great lesson learned:

It's important to educate IT and developers on the kind of computations the server is actually performing, especially if they are going to have to interface with our processes.

No comments:

Post a Comment