Postmortem Report: lucy.mxrouting.net Server Outage
All times referenced are in US/Central time zone.
Deserved Credit
The owner of our server provider, QuickPacket, Jeff drove just over 800 miles and had to spend the night in Ashburn to help us recover from this event. Will your hosting provider do that for you? Because ours deserves a damn standing ovation.
Incident Overview
Initial Outage (September 24):
At approximately 2:00 PM on September 24, the Lucy server experienced a crash and subsequent failure during reboot. Investigation revealed a malfunction in the RAID controller. By 9:00 PM, we had successfully transferred the drives into a new chassis, which included a new motherboard and RAID controller. However, this crash led to file system damage, delaying the full restoration of the server until around 1:00 AM on September 25.
Subsequent Developments
Chassis Replacement (December 5):
Due to the unsuitability of the temporary chassis for the long-term operation of Lucy, a chassis swap was scheduled and executed on December 5 at around 10:30 PM. The swap was expected to be brief, but complications arose.
Complications During Swap:
Post-swap, the server's operating system failed to boot, displaying numerous "input/output errors." Despite several attempts and variations in our repair methods, the system remained unbootable.
Disaster Recovery Efforts
Provider Intervention:
Our service provider, QuickPacket, with team members Jeff and Paul, dedicated significant efforts to physically address the server issues.
Backup Restoration Challenges:
Concurrently, we initiated disaster recovery protocols, only to discover a missed critical step in our backup software, JetBackup. This oversight necessitated a full rsync of 8TB of backups from the backup server to a new server, significantly prolonging the recovery process.
Critical Turning Points
December 6 Developments:
The original server was declared inoperable on the evening of December 6, with suspicions of a hardware issue masked by file system errors.
Final Attempt to Revive Original Server (December 7):
On the morning of December 7, Jeff drove 400 miles to Ashburn for a last-ditch effort to revive Lucy. This involved constructing a new server and transferring the drives. Initially successful, the server soon reverted to its previous "input/output error" state.
Implementation of Plan B
New Strategy:
Abandoning efforts to revive the original server, we shifted focus to rebuilding the Lucy server in Virginia, prioritizing user access over data recovery.
Successful Reconstruction (December 7):
By the night of December 7, we had reconstructed the basic framework of the Lucy server, restoring user access. However, some data, including emails and domain passwords, were initially lost or misplaced.
Ongoing Recovery and Restoration
Data Recovery:
Efforts are ongoing to restore previous emails and operational functionalities. We are prioritizing data retrieval from the original server, followed by backup server recoveries.
Technical Challenges:
Database migrations and compatibility issues between MySQL versions are being addressed. We anticipate some unique, isolated issues and are prepared to address them as they arise.
Reflection and Future Measures
Disaster Recovery Plan Assessment:
This incident, the first of its magnitude since 2013, highlighted significant shortcomings in our disaster recovery plan. We are revisiting our original plan, which involves continuous rsync to a backup server. Backup software is nice and all, but at our scale and with the methods of our deployments, they're not the best choice for this type of event. Our original plan was simply to rsync every server to a backup server, and had we done that here we could have had users back online in 2-4 hours, only taking longer to sync back emails that users had in their mailboxes prior to the event (but allowing them to receive new email and send email during that time). Our original plan was superior, and we're going back to that. This means that once we finish deploying the new old backup strategy, this series of events can never occur again.
In Closing:
Our commitment to service reliability remains unwavering. We regret the inconvenience caused by this outage and are taking robust measures to prevent a recurrence of such an event. We appreciate your patience and understanding as we continue to enhance our systems for better resilience and reliability.