High availability is as much a function of people and process as it is the technology.
It is not uncommon for companies to experience extended, annoying outages during peak periods or at critical points in time. Sometimes these are caused by uncontrollable external forces, but usually the downtime is the result of an internal flaw. More often than not, staff will attribute the cause to faulty hardware or software. There may be truth in the statement, but more likely the root cause was the result of human erroran accidental error or a process problem. Murphy’s Law reigns in IT, as daily there are numerous opportunities for things to go wrong that impact business operations.
The Main Components
In the IT world, everything is composed of one or more of three elements: people, process and technology. It is easy to blame the technology, as it absolves staff and inanimate objectives cannot defend themselves from verbal attacks. But applications, hardware, networks and other software that operate continuously rarely fail for no reason. Something unique happened for the failure to occur. Less than 1 percent of outages are the result of natural disasters, external events or data center relocations. Unscheduled unavailability events caused by system failures or operator error are 15 percent of outages, with the biggest perpetrator being operator error (40 percent). But the major cause of downtime is scheduled events (85 percent), such as backup/restores, maintenance, migrations, new applications going live or upgrades.
Thus, more than 90 percent of outages are the result of human error or faulty processes. This is fixable, which might not require a capital investment, can prevent revenue and/or productivity losses and can improve morale. The fix starts with management re-evaluating its operations handbook and its processes for making changes, and ensuring the processes are followed at all times.
There are two basic questions:
Are all the procedures fully documented?
Are they always followed?
If you are not getting availability in the 99.9 percent range (or less than 42 minutes of downtime per month) or better, then it is highly probable that one or both answers to the questions are “no.”
IT staff, in general, hate documentation; they do not like to write it, and they don’t like to read it, either. Also, many do not think the documented processes pertain to them. Outside of the staid mainframe shops, most IT staff learned their trade acting as “IT cowboys”making changes whenever and however they wanted. Some matured, while others still act like they always did. They go and make updates or other changes to live systems or systems that could impact the production systems without regard to the impact upon the rest of the organization, should the change fail to go as expected. I would say “planned” instead of “expected,” but more often than not, cowboys don’t plan and use the “ready, fire, aim” method. Thus, daily operations are riddled with Murphy’s Law opportunities.
The way to address IT cowboys or other human error outages is strict adherence to well-documented processes. There need to be best-practice procedures for backup/recovery, changes, maintenance, upgrades and daily/monthly/quarterly/annual operations. These procedures need to address the process as it is expected to execute as well as what to do at each decision point and variant path.