Entries tagged with “production”
2008-04-11 02:16:08-0400 - processes (1),production (1),coding (1) - 20 comments
My girlfriend is into Operations Research. Like any boyfriend, I can’t help but get somewhat interested in what she’s doing. Over the summer, she let me read The Goal. Other than the fact that about half way through it got very repetitive, it was a good read. But they were trying to teach us something. (Repetitio est mater studiorum!)
In O.R. (Operations Research), one type of thing you learn a lot about a lot is the factory. Factories are complicated entities, what with all the complicated dependency chains, asynchronous processing, multiple failure points, and deadlines just to mention a few of the problems. I learned about this buzzword-process called the Theory of Constraints about six months ago, and have been pondering about the importance of factory analysis ever since.
Fast forward to today. At the company I work for, we have a problem with processing of logs. It’s complicated. There are at least 4 different sets of rules with their own systems. Every night every single entry in our >50 million rows of entries need to be scanned and analyzed. This isn’t the difficult part, though. We recently moved to a second (and third) data center. So of course, our problems became exponentially worse. Now, I had to start moving computation around (boy do I wish I had the infrastructure for Hadoop/MapReduce). Things start failing more and more often. (Hard disk error, network failure, etc.) How do we make this system more robust?
I made a realization: We’re undergoing the same set of problems that factories undergo. A small failure in one spot propagates and ruins the rest of the system. Our initial approach (try to make everything “work” all the time—I know, naïve) is just staring to crumble. After trying to make errors more visible and making us respond earlier, I finally realized what the problem was:
Our batch sizes are too large.
- Maybe this is common sense, but I don’t think it is for most people. After all, this is a very fundamental realization in Operations Management (for stick figures, read the summary section of this page). Currently, we do all of the processing nightly. This is nice for a couple reasons:
- Simple (thus less prone to breakage)
- Processing runs during hours when the machines aren’t under load [1].
However, there’s one this this is not good for: recoverability [2]. As any system grows—as more and more dependencies get added on—the mean time between some failure will reduce. This is a given. How we handle those failures is what matters. Do we let it propagate to the end user? Do we force it to repetitively wake up our sysadmin every night at 2am? I believe we can answer both of these with a resounding NO. We should build the system to expect failure. To do this, though, we have to reduce the batch sizes. So my new system will contain two major changes:
I. Cut batch sizes from day-long to hour-long
- I think this is obvious. If we need to recover, we need to have time to respond. Lowering batch sizes will help for at least three reasons:
- The time between when the end user sees the information and when the processing starts is increased to 12 hours on average.
- Any failure results in a worst-case run of 25 minutes worth of log processing, instead of 5 hours.
- During the 8-hour business day, there are 8 chances to get notified of code flakiness.
II. Allow the system to be resilient to failure
- This is a little tricky. The basic idea behind this that whenever pieces of the system fail:
- Notify a human of the error.
- Move on (eventual consistency)
- Make it easy (or automatic) to fill in the gaps in subsequent processes
- When are reports are end-user facing, verify consistency
I think this is obvious, but I think the prospect of adding complexity is scary (rightfully so). In this particular case, I think the benefits are too great. Also, from looking around I think most other companies do hourly processing of data, but that’s based on a very quick survey.
| [1] | We don’t really take advantage of this, for fear of complexity. |
| [2] | I know I made that word up. Is nounification really a crime? |

