What we learned this week

This past weekend and the days after it, we learned some valuable things that we'd like to share with you.

How did we get here?

Originally I said that we had no intention of migration cPanel servers to DirectAdmin, though we were provisioning new customers on DirectAdmin servers. As the weight of cPanel's dramatic price increase took hold, we found that we were not able to secure the licensing deals that would allow us to barely float above water. Those licensing deals were calculated into our overhead increase of a bit over $10,000 per year, which meant that this figure was actually a fair bit lower than what we ended up paying.

What happened?

Because of the weight of the cPanel pricing increase, I decided to start experimenting with a migration from cPanel to DirectAdmin. I didn't know if this was going to work well at all, I needed to experience the process and see what worked and what failed. I started this process with a server that desperately needed a hardware upgrade, to take care of two problems at once (server migration was happening regardless). This was our London server (london.mxroute.com).

It took one day to perform the backups necessary to start the process, another day to transfer them to the new server, another to perform the restorations, and almost another to sync the changes that occurred on the old server during those previous days.

After the restores were complete, I noticed that the number of things which transferred perfectly turned out to be impressive. By my account, nearly everything had transferred flawlessly. This is where I faced a difficult decision, and it can be disputed whether or not I made the right one.

What did we learn?

It can be very difficult to gauge the level of failure in a particular change when only viewed through my own perspective. What seems, to me, like a flawless change can be viewed differently by customers who may use features that I've not thought to test, or who had use cases that I hadn't imagined. This was true with the migration of the London server, and I'd like to share with you what I've learned didn't go over so well.

  • Notification, or lack thereof.

I spent days testing to see if this could even work, I didn't want to confuse customers by emailing them all and saying "Hey, sometime in the next week this might change to that, or it might not, we'll have to see." Yet, at the end of it, I didn't want to waste the days that I had spent working toward something that I had deemed a success.

  • Catch-alls didn't migrate.

Catch-all accounts might be expendable to some, but might also be vital to others. That they didn't migrate is an oversight and one that varies in degree by customer.

  • Domain forwarders didn't migrate.

Honestly I didn't even realize customers were using them, forwarders for a whole domain where user@domaina would go to user@domainb, relative to the value of "user." This was another oversight.

  • Filters didn't migrate.

This I was aware of, and they couldn't have been migrated as the two filtering systems are incompatible. However, a more clear plan around how to handle this would have been ideal.

  • SSO in WHMCS broken

You couldn't just log in to DirectAdmin from portal.mxroute.com in the same way that you could with cPanel before. More customers relied on this than I had realized, and this led to the realization of the next issue. This has since been fixed.

  • Users don't know their cPanel passwords

Many users came to us not knowing their cPanel password, therefore not knowing their DirectAdmin password. This was an oversight that can be avoided.

  • CentOS 7 default limits were off

There were several system limits that were way off from defaults. Most noteworthy was the max_user_watches value. I would have never anticipated this, and can't imagine what led to such an odd scenario that I've never faced. Yet, I'll know to look for it next time.

  • Dovecot needed to be tweaked

Dovecot performance quickly tanked after the migration went live. It turns out this was mostly due to the previous issue, but we still did need to tweak some for the level of traffic going through the server.

  • Roundcube contacts did not port over

This has been an issue in most migrations. It was an easy fix when cPanel defaulted to using MySQL for the Roundcube database, but now they use SQLite and it's not reasonable to port them over. It impacted only the few customers that use them, but could have been better planned for.

  • Autoresponders ported poorly

Autoresponders ported in such an odd way that they returned errors rather than messages, and though that didn't actually stop the flow of email it was poorly handled and could have been better prepared for.

  • No redirects for services like webmail, cPanel

We could have redirected ports 2083 and 2096 for cPanel and webmail, and would likely do that if we were to repeat these events.

So what now?

Now we re-evaluate where we stand. We dropped some overhead by getting rid of a cPanel server. That may be sufficient, but there are at least two more servers that could possibly receive the same treatment (Acadia and Aus). If we decide to continue down this path, we've identified what went wrong after interacting with our customers, and we should be able to mitigate those issues for future actions. If we cannot mitigate them reasonably, we would not be able to proceed with any more migrations of this type.

I don't know what we'll do next as I need to evaluate the methods for mitigating these individual issues that occurred. I apologize sincerely and hope that I've made it up to everyone by offering a free 100GB upgrade (even to $5/year and lifetime 10GB customers) to everyone on the London server, as well as credits and/or refunds to those who suffered the worst issues. Whatever happens next, I know that we'll be better equipped for it, and if it involves changes then customers will be better informed of it.