As you are aware, Yipes Enterprise Services had an extended outage associated with a scheduled maintenance on August 9th. Due to the severity of the outage I felt it was important to provide you with a summary of the Root Cause Analysis recently completed on this outage, detail the procedural changes that have been implemented to avoid this type of outage in the future and communicate our commitment to the quality and reliability of our service to your company.
Summary of Root Cause Analysis
Yipes scheduled a maintenance window to upgrade our core network routers and edge switches to the latest Generally Available (GA) Extreme Operating System 7.0.1 Build 11. This code had been certified within our own network lab and procedures were developed and documented for implementing the new code. As part of our process for implementing new code into our production network, we introduce the new code into an initial market and monitor for a week or more. The code was successfully implemented into our first market on July 31st, utilizing the procedures developed between Extreme and our internal lab. The code soaked successfully for a week with no feature interaction, code failures, network failures, service failures or code related error messages. As a result, a maintenance schedule was developed to implement the code throughout our production network. Prior to the failure in the Chicago market, three markets had been successfully upgraded to the new Extreme 7.01 Build 11 OS utilizing the procedures r
Following identical procedures as in previous markets, the load of the new Extreme OS began in Chicago on August 9th. The procedure to implement the new OS has 3 major steps: 1) Run a script to download 6.22 build 68 and reboot devices, 2) Run script to download BootROM 7.8 then reboot devices and 3) Run script to download 7.0.1 build 11 then reboot devices. Step 1) Completed successfully. Step 2) failed halfway through the download process and Step 3) was never attempted.
Analysis determined the BootROM 7.8 code was corrupted in the download process, affecting 23 of the 60 devices in the network. The current Operating System within Yipes network does not validate the code prior to accepting the download. The new 7.0.1 B11 OS has a CheckSum process that will ensure corrupted code cannot be loaded into the device. It was this corrupted code that disabled most of the Chicago network. As a result of the corruption, Yipes NOC was unable to either remotely or locally back out to the previous code. At this point, approx. 30 minutes after the start of the maintenance, Yipes contacted Extreme technical support for assistance.
Extreme technical support and Yipes Tier 4 technical support along with the NOC, developed a course of action to recover the network. The recovery process required an Extreme Engineer with the BootROM code 6.22 build 68 loaded onto a PCMCIA card to locally upload the BootROM onto the affected device. Plans were made for the nearest engineer to fly into Chicago to perform the procedure. In the interim, Yipes used all available maintenance spares and working equipment not serving customers to replace affected devices. Extreme also worked with the local depot to send over all available spares in the Chicago market. Over the next several hours, devices were replaced, tested and placed into service. The Extreme Engineer arrived and the remaining devices were restored via the PCMCIA process. Due to the nature of the outage and the requirement to perform tasks locally to restore the devices, the outage was very extended.
Next Steps - Process Improvements
Our technical team working with Extreme have developed 2 additional procedures to ensure this type of failure does not occur again.
* The team will perform CheckSum and FileCompare procedures on the downloaded code to insure it matches the GA code.
* The network will be further segmented to lessen the impact of loading code into an entire MSA.
I understand the impact an outage like this has on your business and believe we have taken the appropriate steps to substantially mitigate the risk of an outage from future loads of an OS into our network. This new code gives us significantly better monitoring, stability and check processes that will translate into a better service experience by you and your customers. While this communication was meant to merely summarize this outage and the Root Cause, we are prepared to go into much more detail at your request. If that is something you would like, please contact your sales representative to request a meeting.
We thank you for your business and your patience. Yipes is committed to improve your recent experience and demonstrate we deserve your business.
Executive Vice President - Sales and Operations
Yipes Enterprise Services
You can read more about it there.
the downtime lasted around 6-7 hrs
█•CirtexHosting•Providing Affordable and Quality Web Hosting & Reseller Hosting since 2003 █•LINUX based cPANEL/WHM Shared and Reseller Web Hosting with Fantastico █•HostV VPS•Premium Virtual Private Servers & Dedicated Servers powered by cPanel/WHM █•We transfer your sites over quickly!•I eat penguins for breakfast ...