Getting all parts of the airline back to normal from the glitch.
On Aug 8, 2016 Delta Airlines experienced an extended six-hour outage, which some analysts and journalists falsely attributed to flaws associated with aging technology – i.e., TPF running on mainframe systems.
This false trope failed to identify the real culprits, which are exposures to all IT solutions – in public and private clouds and even more so in non-mainframe environments. IT executives should not alter their views about certain technologies or decision-making based upon biases and false reporting of facts. Nor should business or IT executives assume that outages will become history if they move to the cloud.
Normally I do not write about a single system outage that impacts a company but the Delta downtime became a cause célèbre.
On Aug. 8th at 2:38 AM EDT a power outage hit the Delta data center, which caused a global system failure that lasted six hours before business could begin to go back to normal. Tens of thousands of passengers were stranded around the world and all systems – check-in, flight scheduling and departures, airport screens, reservations, websites, etc. – were affected by the meltdown.
Getting all parts of the airline back to normal from the glitch and all passengers to their ultimate destinations actually took days as hundreds of Delta flights and flight crews were out of position post recovery.
So who’s to blame?
The True Story
The initial report that a power outage was the culprit was partially correct. According to Delta’s COO, “a critical power control module at our Technology Command Center malfunctioned, causing a surge to the transformer and a loss of power. When this happened, critical systems and network equipment didn’t switch over to backups.
Other systems did.
And now we’re seeing instability in these systems.” What the executive did not mention was that it all started when Delta’s IT staff attempted to perform a routine switch to its backup generator, which resulted in a spike that caused a fire in an Automatic Transfer Switch (ATS).
Thus, in effect what Delta and users experienced was the result of a two-step failure. First, the ATS fire and subsequent shutdown meant that a server farm of about 500 servers also closed down abnormally.
Second, Delta’s staff then executed its standard failover process and executed switchovers to the backup IT systems. But this process also failed as critical systems and network equipment did not switch over to backup power. It was determined after the fact that about 300 of the approximately 7,000 data center components (of which the TPF mainframes were a very small component) were determined to not have been configured correctly to the available backup power and therefore remained offline without power.
Even before the details of the problem were made public it was apparent that Delta’s power outage impacted only them, as there are two unique power grids feeding the site and one provider, the Atlanta utility Georgia Power, claimed it was not responsible for the failure and had not received notifications of any outages in its territory.
In fact, Delta’s passenger service system (PSS) like all major PSSs are theoretically configured with no single point of failure – from the power supply through all equipment components and databases. But in Delta’s case there was either a lack of redundancy or the backup ATS failed to kick in as expected.
Next- Outage Track Records