Batch Sizes

2008-04-11 02:16:08-0400 - processes (1),production (1),coding (1) - 20 comments

My girlfriend is into Operations Research. Like any boyfriend, I can’t help but get somewhat interested in what she’s doing. Over the summer, she let me read The Goal. Other than the fact that about half way through it got very repetitive, it was a good read. But they were trying to teach us something. (Repetitio est mater studiorum!)

In O.R. (Operations Research), one type of thing you learn a lot about a lot is the factory. Factories are complicated entities, what with all the complicated dependency chains, asynchronous processing, multiple failure points, and deadlines just to mention a few of the problems. I learned about this buzzword-process called the Theory of Constraints about six months ago, and have been pondering about the importance of factory analysis ever since.

Fast forward to today. At the company I work for, we have a problem with processing of logs. It’s complicated. There are at least 4 different sets of rules with their own systems. Every night every single entry in our >50 million rows of entries need to be scanned and analyzed. This isn’t the difficult part, though. We recently moved to a second (and third) data center. So of course, our problems became exponentially worse. Now, I had to start moving computation around (boy do I wish I had the infrastructure for Hadoop/MapReduce). Things start failing more and more often. (Hard disk error, network failure, etc.) How do we make this system more robust?

I made a realization: We’re undergoing the same set of problems that factories undergo. A small failure in one spot propagates and ruins the rest of the system. Our initial approach (try to make everything “work” all the time—I know, naïve) is just staring to crumble. After trying to make errors more visible and making us respond earlier, I finally realized what the problem was:

Our batch sizes are too large.

Maybe this is common sense, but I don’t think it is for most people. After all, this is a very fundamental realization in Operations Management (for stick figures, read the summary section of this page). Currently, we do all of the processing nightly. This is nice for a couple reasons:
  1. Simple (thus less prone to breakage)
  2. Processing runs during hours when the machines aren’t under load [1].

However, there’s one this this is not good for: recoverability [2]. As any system grows—as more and more dependencies get added on—the mean time between some failure will reduce. This is a given. How we handle those failures is what matters. Do we let it propagate to the end user? Do we force it to repetitively wake up our sysadmin every night at 2am? I believe we can answer both of these with a resounding NO. We should build the system to expect failure. To do this, though, we have to reduce the batch sizes. So my new system will contain two major changes:

I. Cut batch sizes from day-long to hour-long

I think this is obvious. If we need to recover, we need to have time to respond. Lowering batch sizes will help for at least three reasons:
  1. The time between when the end user sees the information and when the processing starts is increased to 12 hours on average.
  2. Any failure results in a worst-case run of 25 minutes worth of log processing, instead of 5 hours.
  3. During the 8-hour business day, there are 8 chances to get notified of code flakiness.

II. Allow the system to be resilient to failure

This is a little tricky. The basic idea behind this that whenever pieces of the system fail:
  • Notify a human of the error.
  • Move on (eventual consistency)
  • Make it easy (or automatic) to fill in the gaps in subsequent processes
  • When are reports are end-user facing, verify consistency

I think this is obvious, but I think the prospect of adding complexity is scary (rightfully so). In this particular case, I think the benefits are too great. Also, from looking around I think most other companies do hourly processing of data, but that’s based on a very quick survey.

[1]We don’t really take advantage of this, for fear of complexity.
[2]I know I made that word up. Is nounification really a crime?

Bookmark this entry: Slashdot Digg del.icio.us Newsvine Stumbleupon

Comments on This Post:

Posted by yalu

United States

on April 14, 2008

OR is really useful in all sorts of companies. I'm proud ;-)

Reply To This Comment

Posted by Usesimmigue

China

on Feb. 04, 2011

The product pet supply grouping includes nurture and supplies for pet supply the sake of horses, dogs, cats, wild birds, rabbits, poultry, pet supply calves, deer and minuscule animals, as extravagantly as fencing, pet supply do setting-up exercises clothing and boots, and ill-defined farm supplies. Ranch & Particular Stock also has a customized advertising program, store planning services, stock signage and decor, and other services to arrogate retailers broaden sales and profits http://www.lovelonglong.com Would like to unsubscribe from this business, please send your URL to the E-mail: unsubscribe@1yingxiao.com to unsubscribe

Reply To This Comment

Posted by speeeed

Poland

on Feb. 04, 2011

Hi, regards all :) Filmy

Reply To This Comment

Posted by roofernorfolkxx

Germany

on April 30, 2011

This is a very interesting site. The content is very informative and I am so glad that I dropped by. Thanks!


Virginia Beach Roofing

Reply To This Comment

Posted by Abnodomeamb

United States

on May 04, 2011

http://cleotildewiss.tennerblog.com/ 5

Reply To This Comment

Posted by Adrien

Germany

on May 19, 2011

You got great points there, that's why I always love checking out your blog.

My blog:
meilleur taux aussi societe Rachat de Credit

Reply To This Comment

Posted by virginiabeachroofingvvbc

Germany

on June 14, 2011

This Looks like awesome forum or is this a blog? Sorry I am a newbie.


Virginia Beach Roofing

Reply To This Comment

Posted by RoofingDayton

Germany

on June 21, 2011

This Looks like awesome forum or is this a blog? Sorry I am a newbie.


Dayton Roofing

Reply To This Comment

Posted by prureveadia

Israel

on June 23, 2011

Hello there , searching for the best nearly insightful penny stock news letter? .
FREEZING have one simply take check out I try and throw the best and bracing data, so if nevertheless this is of involvement to you personally please check out my penny stock newsletter.

Reply To This Comment

Posted by axonundudge

Germany

on June 27, 2011

Visit us now to see more information and facts regarding to

odzywki Universal Nutrition

Reply To This Comment

Posted by Andre

Germany

on July 07, 2011

Vous avez de bons points il, c'est pourquoi j'aime toujours verifier votre blog, Il semble que vous etes un expert dans ce domaine. maintenir le bon travail, Mon ami recommander votre blog.

Mon francais n'est pas tres bon, je suis de l'Allemagne.

Mon blog: courtier en credit et organisme de rachat de credit

Reply To This Comment

Posted by Rooferguyddcxxx

Germany

on July 15, 2011

Wow What A Great Site! I love the content.


Dayton Roofers
Dayton Roofing

Reply To This Comment

Posted by godroofermanxxy

Germany

on July 19, 2011

I found this website and I must say that it looks wonderful and I am glad to be a part of this community.


Dayton Roofers

Reply To This Comment

Posted by BUTUNINOFUM

Poland

on Aug. 07, 2011

A shared trap hosting advantage or effective hosting utilization or receive assemblage refers to a web hosting checking where myriad websites reside on harmonious snare server connected to the Internet. Each situate "sits" on its own break-up, or section/place on the server, to keep it discriminate from other sites. This is customarily the most thrifty option owing hosting, as diverse people allocation the inclusive bring in of server maintenance. hosting

Reply To This Comment

Posted by optoplasp

Germany

on Aug. 18, 2011

Hi

What do you thing about below diet supplement? I'm going to buy something good for muscle growth. Please give me a piece of advice.

Jak zadbaæ o siebie?

Reply To This Comment

Posted by Edideoday

Poland

on Sept. 07, 2011

A shared snare hosting service or effective hosting worship army or derive publican refers to a network hosting checking where myriad websites reside on joke trap server connected to the Internet. Each site "sits" on its own allotment, or section/place on the server, to keep it separate from other sites. This is customarily the most stingy choice on account of hosting, as numerous people apportionment the entire bring in of server maintenance.
tani hosting

Reply To This Comment

Posted by higaloek

United Kingdom

on Sept. 27, 2011

Posted by GlallVoks

United States

on Oct. 04, 2011

Get in our website and read more about
Dlaczego s¹ Ci potrzebne suplementy diety

Reply To This Comment

Posted by jerxruvedu

China

on Nov. 27, 2011

Louis Vuitton and ugg boots on sale authorized: louis vuitton replica purses

Consult Hadley: Hadley Freeman for the fey and also weak design glimpse in addition to Balmain jacket | Life and elegance
ugg boots: http://www.ukbootser.com.
Why implement it many female styles around sets appearance like they may be anxious for this loo? Or perhaps it really fashionable to look fey plus weak?

Reply To This Comment

Reply to the original post