hosted by liquidweb


Go Back   Web Hosting Talk : Web Hosting Main Forums : Dedicated Server : FDC's response to unstable network issues from Yipes
Reply

Forum Jump

FDC's response to unstable network issues from Yipes

Reply Post New Thread In Dedicated Server Subscription
 
Send news tip View All Posts Thread Tools Search this Thread Display Modes
  #1  
Old 08-13-2003, 06:59 PM
viastudio viastudio is offline
New Member
 
Join Date: May 2003
Posts: 1

FDC's response to unstable network issues from Yipes


From: SALES - FDCservers.net To: Chuck Hogg
Subject: Yipes network downtime 08/10/2003 - Yipes explanation Cc:


Reason For Outage: August 10, 2003

Dear Customer,

As you are aware, Yipes Enterprise Services had an extended outage associated with a scheduled maintenance on August 9th. Due to the severity of the outage I felt it was important to provide you with a summary of the Root Cause Analysis recently completed on this outage, detail the procedural changes that have been implemented to avoid this type of outage in the future and communicate our commitment to the quality and reliability of our service to your company.

Summary of Root Cause Analysis

Yipes scheduled a maintenance window to upgrade our core network routers and edge switches to the latest Generally Available (GA) Extreme Operating System 7.0.1 Build 11. This code had been certified within our own network lab and procedures were developed and documented for implementing the new code. As part of our process for implementing new code into our production network, we introduce the new code into an initial market and monitor for a week or more. The code was successfully implemented into our first market on July 31st, utilizing the procedures developed between Extreme and our internal lab. The code soaked successfully for a week with no feature interaction, code failures, network failures, service failures or code related error messages. As a result, a maintenance schedule was developed to implement the code throughout our production network. Prior to the failure in the Chicago market, three markets had been successfully upgraded to the new Extreme 7.01 Build 11 OS utilizing the procedures r
eferenced above.

Following identical procedures as in previous markets, the load of the new Extreme OS began in Chicago on August 9th. The procedure to implement the new OS has 3 major steps: 1) Run a script to download 6.22 build 68 and reboot devices, 2) Run script to download BootROM 7.8 then reboot devices and 3) Run script to download 7.0.1 build 11 then reboot devices. Step 1) Completed successfully. Step 2) failed halfway through the download process and Step 3) was never attempted.

Analysis determined the BootROM 7.8 code was corrupted in the download process, affecting 23 of the 60 devices in the network. The current Operating System within Yipes network does not validate the code prior to accepting the download. The new 7.0.1 B11 OS has a CheckSum process that will ensure corrupted code cannot be loaded into the device. It was this corrupted code that disabled most of the Chicago network. As a result of the corruption, Yipes NOC was unable to either remotely or locally back out to the previous code. At this point, approx. 30 minutes after the start of the maintenance, Yipes contacted Extreme technical support for assistance.

Extreme technical support and Yipes Tier 4 technical support along with the NOC, developed a course of action to recover the network. The recovery process required an Extreme Engineer with the BootROM code 6.22 build 68 loaded onto a PCMCIA card to locally upload the BootROM onto the affected device. Plans were made for the nearest engineer to fly into Chicago to perform the procedure. In the interim, Yipes used all available maintenance spares and working equipment not serving customers to replace affected devices. Extreme also worked with the local depot to send over all available spares in the Chicago market. Over the next several hours, devices were replaced, tested and placed into service. The Extreme Engineer arrived and the remaining devices were restored via the PCMCIA process. Due to the nature of the outage and the requirement to perform tasks locally to restore the devices, the outage was very extended.

Next Steps - Process Improvements

Our technical team working with Extreme have developed 2 additional procedures to ensure this type of failure does not occur again.
* The team will perform CheckSum and FileCompare procedures on the downloaded code to insure it matches the GA code.
* The network will be further segmented to lessen the impact of loading code into an entire MSA.

Conclusion

I understand the impact an outage like this has on your business and believe we have taken the appropriate steps to substantially mitigate the risk of an outage from future loads of an OS into our network. This new code gives us significantly better monitoring, stability and check processes that will translate into a better service experience by you and your customers. While this communication was meant to merely summarize this outage and the Root Cause, we are prepared to go into much more detail at your request. If that is something you would like, please contact your sales representative to request a meeting.

We thank you for your business and your patience. Yipes is committed to improve your recent experience and demonstrate we deserve your business.


Sincerely,

Tim Mason
Executive Vice President - Sales and Operations
Yipes Enterprise Services



Sponsored Links
  #2  
Old 08-13-2003, 07:16 PM
wubwob wubwob is offline
Web Hosting Master
 
Join Date: Apr 2003
Posts: 735
how long did the downtime last?

  #3  
Old 08-13-2003, 07:18 PM
Cirtex Cirtex is offline
WebHostingTalk Lover
 
Join Date: Mar 2003
Location: New York City
Posts: 7,392
Quote:
Originally posted by wubwob
how long did the downtime last?
http://www.fdcservers.net/vbulletin/...?s=&forumid=11

You can read more about it there.
the downtime lasted around 6-7 hrs

__________________
CirtexHosting Providing Affordable and Quality Web Hosting & Reseller Hosting since 2003
LINUX based cPANEL/WHM Shared and Reseller Web Hosting with Fantastico
HostV VPS Premium Virtual Private Servers & Dedicated Servers powered by cPanel/WHM
We transfer your sites over quickly! I eat penguins for breakfast ...

Sponsored Links
  #4  
Old 08-13-2003, 07:24 PM
WII-Aaron WII-Aaron is offline
Community Guide
 
Join Date: Apr 2002
Location: Kansas City, MO
Posts: 2,472
Quote:
Originally posted by Hoobastank68
http://www.fdcservers.net/vbulletin/...?s=&forumid=11

You can read more about it there.
the downtime lasted around 6-7 hrs
Ouch. I thought they had a backup.

Aaron

Reply

Related posts from TheWhir.com
Title Type Date Posted
LiquidWeb Among Companies Affected by Major Outage Across US Network Providers Web Hosting News 2014-08-12 17:10:54
Network Issues Cause Amazon Cloud Outage Web Hosting News 2013-09-16 12:18:53
Email Outage Hits Intermedia Users, Network Issues to Blame Web Hosting News 2013-09-04 10:37:28
Network Solutions Customers Face Access Issues Following DDoS Attacks Web Hosting News 2013-07-24 15:28:25
cPacket Launches SPIFEE for Service Providers to Detect Network Performance Issues Web Hosting News 2013-01-28 16:21:35


Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes
Postbit Selector

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is Off
HTML code is Off

Forum Jump
Login:
Log in with your username and password
Username:
Password:



Forgot Password?
Advertisement:
Web Hosting News:
WHT Membership
WHT Membership



 

X

Welcome to WebHostingTalk.com

Create your username to jump into the discussion!

WebHostingTalk.com is the largest, most influentual web hosting community on the Internet. Join us by filling in the form below.


(4 digit year)

Already a member?