The timeout is set at a monstrous 15 seconds, and so far we are only running 20% of all transactions through the box, so the timeouts are a material, yet sporadic issue.
The situation is even more weird in that the timeouts around the process aren't occurring at the R box, but instead in an internal service layer executes well prior to calling R. In essence, the underlying issue has nothing to do with R in production.
I knew the issue wasn't in the R box two months ago, but was only recently able to convince IT to put the logging on their service layer that would prove it (I had logging in R from the beginning). To quote one IT director
We know what your logs say, but still there's 99.9% this HAS to be in your process.I feel like I should have given a lesson in statistical certainty, but that's for another time.
So. Why did IT think it HAD to be a problem with R? I have a few thoughts:
- They are all Microsoft/C# developers, and seem skeptical of the one Linux box in production (The OS doesn't even have a GUI, and was FREE!)
- R is a new product to them, and as such, it's the first place they look for problems.
- There seemed to be an assumption that each call to the R box was conducting massive statistical computations (kind of true, but nothing that modern hardware can't handle in <1 second).
Reasons 1 and 2 are just how developers (especially Microsoft developers) view new, open source products. Reason 3 is actually a misconception about the process, and I think it's a great lesson learned:
It's important to educate IT and developers on the kind of computations the server is actually performing, especially if they are going to have to interface with our processes.
No comments:
Post a Comment