I failed, I'm sorry
Hey friends,
I come to you in a public space to do a few things, and to do it all without my beloved ChatGPT. Let me make myself an outline so I don't forget what I'm here to do:
1. I hope this message finds you well. Just kidding.
2. Explain what happened
3. Explain what we’re doing
4. Explain what was learned from it
5. Ask for your forgiveness
6. Ask for a second chance
So here's what happened
On September 24th the RAID controller failed for the lucy.mxrouting.net server. Through incredible late night efforts, including such rockstars as Jeff & Fran, the OS was recovered and booted after a roughly 12 hour outage. However, it's disks were transplanted into a 12 bay server. I didn't want to pay for that chassis, Jeff didn't want to waste that chassis on a 4 disk server. Understandably, we knew it was a temporary home but it was the fastest way to get the job done by remote hands. It's not that they're incapable, remote hands, it's that they're not experts on your hardware and the simplest instructions have the least chance of mistake. Anyway, next.
The chassis swap needed to be coordinated between myself and Jeff, and my schedule was pretty booked. We're talking about the days and weeks surrounding Black Friday, I'm pretty busy. We managed to get a time for a chassis swap back to the repaired server (new RAID controller, at least) on December 5th. It's a disk swap scenario, this is barely worse than a reboot. In fact we had rebooted prior just to test that it came back fine. So we went ahead with the maintenance, and Lucy wouldn't boot in the other chassis. You see, here's the big problem with rebooting this server: MySQL will shut down as it damn well pleases, in 6-8 hours. I may be exaggerating, but any way you spin it, if we wait for OS services to spin down normally this becomes a multi hour outage. So both the test reboot and the power off for the chassis swap were hard power events. It's fine, a little InnoDB recovery happens sometimes, it's not a huge deal and we have backups. Little did we know, each one of those hard power events was taking more and more of a toll on an already barely stable file system (never again, XFS) and this last one was the straw that broke the camel's back.
So there we are on December 5th, around 10PM. with a Lucy that just isn't going to boot. I'm not great with boot stuff, I ask for help. The great minds of Jeff and Paul, and some advice from Fran along the way, we get it booted after an xfs_repair (after that segfaults a few times, and not due to a memory issue). Then starts the "input/output" errors flooding the console, it's fucked. Reboot, no go. Another xfs_repair, it boots. Input/output error. You see the pattern, and every time we fill the lost+found folder with new guests.
At this point I begin working to restore backups to a Hetzner server, because my backups are there and transferring several TB is best done in the same datacenter. Even still, because they're incremental and not archives, transferring those backups to a nearby server was going to take days (and it did take days).
So there I am having automated my current step, while the guys are poking around at the original server trying to help me save face. I'm transferring backups, not exactly a process that needs me to baby it. After so much time and effort, I become convinced that it's a hardware issue with the original server. It doesn't matter to me that Jeff double checked and tested every single thing in this server. I'm not sold. I wish I could tell you I didn't cry over this, but as soon as everyone else was asleep I was in tears. This is everything to me, this is how I have a roof over our heads and food on the table. If I fail, we fail. Jeff felt that pain and drove 400 miles to build me a brand new server and put those disks in. We got Lucy back up for about 30 minutes. I was there repairing what xfs_repair took, copying key system files from an identical system, reinstalling perl, rebuilding services. Just when I get it all working and Jeff is questioning his sanity, boom. Input/output error. He was right, I was wrong, but that beautiful man still did this for me. Remember QuickPacket when you need a dedicated server, but anyway.
I booted into a recovery ISO, turned on networking, ran xfs_repair again, mounted the file system, and began using that as the framework for restoring Lucy to a new server in Virginia. I had to rebuild domain password files for a few hundred domains. I had to rebuild alias files, carefully copy what was needed into passwd, shadow, group, you know the job. I was restoring by hand, not from the multiple TBs of backups all the way in Germany. When I finished creating the accounts and their file/folder structures, I turned it on and told users that their emails would be coming back in two rsync jobs. Once from the previous server's mounted FS, and once from the maildirs in the incremental JetBackup backups.
We got it all settled and I declared Lucy fixed on the afternoon of December 10th. I wasn't done, I still had tickets to take care of for one-off fixes all over the box. But none were connected to each other or globally shared issues, so at that point I considered it "done" but kept working.
Doing things like traversing all directories in /home on a ~2800 user box, with resellers frantically running the DirectAdmin backup system to gzip and transfer their backups for safe keeping, and while the RAID was syncing, we had about 25% iowait on average with occasional spikes of up to 50%. So while I had come up with a better backup plan that would help me recover from this type of event faster, I wasn't going to bring the box to it's knees to enact it. That was going to go into effect on the afternoon of December 16th, as iowait had dropped to 0% and things were settled enough to be worth taking a new backup. I went to bed the night before, set an alarm, and knew that I'd be doing backups after lunch. Murphy, unfortunately, had other plans.
On the morning of December 16th, bright and early (just after 5AM my time) I received a page from customers. Just a page, as Lucy was checking all of the boxes for monitoring (ping, SMTP). It seemed that parts of the server were inaccessible. In fact, I couldn’t SSH into it. There were permission errors across the IPMI console. I rebooted. After all, we’re on ext4 now, it’s not as sensitive as XFS seemed to be under our load/conditions. Then I get an error from the RAID controller about a “foreign configuration.” As with boot problems, hardware RAID lingo isn’t my strong point. I called in help. It was determined that at least 1 drive had been dropped from the RAID. We had remote hands reseat the drives, no dice. After a lot of work, again by Jeff who is a master at these things that I’m not, things weren’t looking much better. We still don’t know with 100% certainty what happened, and we’re still working to recover what we can from that server, but this is what we think happened:
With all of the repair operations digging through the folders on /home and with resellers doing so many backup jobs at once, we believe that one of the disks took so long to sync that the controller gave up on it. I wasn’t monitoring the RAID yet, I wasn’t even done with my work on the box yet, so I wouldn’t have noticed at the time. We then believe a second drive started failing, as it began showing SMART errors that weren’t there when we provisioned the box, and as you probably know this is the death moment for RAID10. You get one out of sync, that’s it. Any others, the end.
Here’s what we’re doing
The backups that I was transferring in Germany to a new nearby server, just in case we couldn’t resurrect Lucy quickly enough in Virginia, had finished transferring to that new server days ago. However, since JetBackup requires that you check an additional box to enable a feature that makes restoring backups not absolute hell (and leaves it off by default), I still had to package them all up and restore them one by one, each account. After scripting it, I broke the job into 12 tasks (each with a list of 250 or less accounts fed to it) to perform the tar and restore on each account. Those restores are still running. As I’m writing this sentence, we’ve restored 1009 of them. I expect the rest to be done by the end of tomorrow (Monday, US/Central).
We’re cloning all 4 disks from the RAID on the previous Lucy server in hopes that we might get around any possible disk issues and be able to focus on recovering whatever we can from it’s file system, because that’s where the last week of email for those users resides. Because, again, the first backup of the new server was to take place after this outage occurred.
We’re removing JetBackup on all systems and going to an rsync backup. With this, we can rebuild a server’s framework more quickly and get users back online, and then transfer email data over afterward, with ease. Of course an rsync of MySQL while running is a little shitty but again, a little InnoDB recovery isn’t too bad and we're talking about the scenario that currently holds a "once in 10 years" stat for us.
We also credited every user $25 because our most expensive reseller plan is $25/m, and while I wanted to make a more grand gesture, I do need to keep the lights on and we are an intentionally low margin service.
Here’s what we learned
Servers of this size cannot be backed up using traditional, easy backup systems. A whole snapshot of the server would take weeks, a gzipped archive of each account would take a week to finish one run. I can recover from a copy of the server’s file system within a few hours, and it’s not like this is an everyday situation. Hell, it took 10 years to see one like it. Also, thank you JetBackup for making it optional and off by default to export the JB config to the backup server.
I ask for your forgiveness
As a long time provider in the hosting community, I’ve failed to be the best version of me that I could be. I wasn’t as ready for this as I thought I was. I’m sorry. I hope that you can forgive me.
A second chance?
I ask for a second chance. I’ve learned a lot from this. I will be more ready for this type of event in 1-2 weeks and I SWEAR THAT IS NOT A CHALLENGE, MURPHY.
Answering these questions in advance
I get it. Email storage should be replicated. All servers should have a failover. Everything should be WebHostingTalk’s incorrect definition of “cloud” (high availability). I didn’t set out to just make another email service, I set out to model it after shared web hosting because that’s what I knew, that’s how I could outsource the frontend to a company like DirectAdmin (or previously, cPanel). I wanted to master outbound delivery, not remake Gmail and charge their prices. This is what I made, this is what I’m committed to. I can do better without throwing the entire business plan in the garbage and starting from a clean slate with a set of investors and developers, only to be “yet another $5/m per email user” provider. I appreciate the thoughts and concern, I do appreciate advice, but I’m really not in the mood to hear all of that right now. Say what you want, I just don’t want (not that I can have what I want) to hear “Why is each server a standalone box with no failover connected to a $400,000 storage array for services often billed during promotions as low as $5/year?” We can't revolutionize price, virtually limitless users per account, and true high availability at the same time. But we do pretty damn good at availability, relatively. I mean, It's not like those other companies never have issues either. We just have to be more creative.