Yet Another Business Case For Proactive IT Capacity Planning

Aligning IT and business goals is a widely-understood mandate for today’s CIO. Toward that end, not just maintaining but increasing business process efficiency is de rigueur. To achieve this efficiency, the assistance of ERP, CRM and like packages from vendors such as SAP, PeopleSoft, and Microsoft are pieced together with an array of tools and technologies with names like iViews, Web Parts, and so on.

The idea is, in large part, to save senior decision makers and/or their subordinates time by automating repetitive tasks that used to involve many manual operations supplemented by face-to-face meetings. So, anything that renders these new business process systems unavailable runs counter to attaining the efficiencies they’re meant to achieve.

If you’re lucky, your time-saving system might be unavailable (e.g., experiences a Web portal bottleneck resulting from too many end users simultaneously asking for its services or a printer that’s simply out of paper) for no more than a few seconds each day while felled seriously by a real disaster (e.g., be hit by a flood, fire, or earthquake) for no more than a day or so once every decade or two.

The shorter, sometimes daily kind of interruption is seen as an inevitable annoyance but not a material threat to the core business. As often as not, these brief interruptions are due to inadequate investment in capacity planning and are only remedied from time to time, when upgrades to hardware or software are funded.

In contrast, business-continuity systems (that run in parallel with your business process systems) are maintained continuously to protect against the down time that a disaster could deliver. The budgeted investment in these costly business continuity systems are often justified with the help of calculations such as:

Availability = 100% x reached uptime / planned uptime

and

Reliability = 100% x MTBF / (MTBF + MTTR)

[MTBF = Mean Time Between Failures] [MTTR = Mean Time To Recover]

When used in the normal budgeting process, these calculations rely on estimates of the substantial interruptions that could be caused by a rarely occurring disaster.

However, the more common loss that most users experience lasts for only very brief period of time. So, in practice, the formulas are seldom applied to the aggregate of these brief interruptions over the same period of time, typically a decade or more, as that between disasters.

Lets say, for purposes of discussion, that these frequent inconveniences occur, on average, for only 20 seconds a day; a number that’s a good deal smaller than is warranted by my personal experience. This means something like 50,000 seconds (or 14 hours) over one decade and 100,000 seconds (or 28 hours) over two decades.