Downtime-free Drupal Migration
In Jauary we migrated a Drupal site that routinely has 40k+ hits per day. We moved the site from servers in the Pacific Northwest to a datacenter in Virginia. As if that wasn't enough, we moved the servers from Apache to Nginx, as well. But what makes this remarkable to me is that we managed to pull this off without so much as a minute of downtime. This blog explains how we did it (and it uses lots of pretty diagrams, too!). <!--break-->
Step 1: The Initial Configuration
Let's get one of those diagarms up here right away. This is what our network looked like before we began the migration.
The image above shows a simplified network diagram. On the right side are the old servers which we were decommissioning. On the left side (in the shaded box) are the new servers. We had already preconfigured these servers, run load tests, and done all of the usual pre-launch testing. In short, they were configured and ready to go.
However, before we could push the new servers onto the frontline, we needed to do one final synchronization from the old (but live) databases. Furthermore, we had to redirect the DNS entries, which meant dealing with the great unknown of DNS caching. How long would it take before all of our site visitors were redirected from the old IP addresses to the new ones? We had to plan for the slow ones.
To accommodate these two factors, we devised a strategy that involved taking a few intermediary steps in the migration.
We devised this process:
- Configure the new servers to reverse-proxy Drupal requests to the old servers
- Switch DNS from the old load balancer to the new one.
- Put the site in read-only mode just long enough to transfer the database
- Do a final database load on the new DB server
In what follows, I explain how we accomplished this process in practice.
Step 2: Change the Front Line
To follow the process above, we put our servers in a sort of intermediate configuration. Here is what our network looked like during this intermediate step:
While we had to execute the four steps above in order, most of them could overlap. So, for example, we didn't need to wait for DNS to propagate before we continued the migration. Users getting the old DNS record would still access the old load balancer (on the right), while new users were immediately directed to the new load balancer and to the new web servers.
Yes, having this proxy-pass configuration was a source of inefficiency, but it was a cost we could certainly afford to incur in exchange for zero downtime. Configuring proxy-pass was almost trivially easy with nginx, which is built specifically to perform that role when necessary. And since we had all of the major assets already loaded on the new servers, auxiliary files like images, JavaScript, and CSS could be served from the new server. Only dynamic requests had to be served through the proxy.
We did have to truncate session tables and remove permissions for login during the migration. Drupal would not have reacted well had we not done that. But most of our content does not require authentication, and our users were notified that for a brief window they would not be able to access their accounts, the forums, and other authenticated services. As far as we know, nobody experienced more than momentary inconvenience in this regard. They were really only blocked during the database dump and load.
Step 3: Switch to the New Database
Finally, once the database had been loaded onto the new server, we could begin turning off our temporary layer, gradually moving to the "new normal" configuration:
The new nginx web servers returned to their normal role, running Drupal locally. Authentication was immediately re-enabled for users, and the proxy-pass layer was disabled.
We left the old site in read-only mode for the next few days, making sure that the last of the old users could still access the site even when their DNS providers were slow. Of course, we displayed a big banner at the top of every page notifying them that they were seeing an old version. But actually, almost all users saw the new DNS records within an hour of us switching.
Once we were in the clear, we decommissioned the old servers.
The Bottom Line
We managed to do this entire migration without any downtime. Most of our users never noticed the changes, though a few got kicked out of the chat room and had to reconnect. (They were warned ahead of time, of course.) All in all, the process went as close to flawlessly as any migration can. We have been on the new architecture for a month and have had to make only minor fixes and changes.
Would we recommend this or a similar strategy for others in the same position? Yes. It was definitely an expedient method of migrating. But we would also offer this advice: Test as much as you possibly can before doing the migration. We ran close to three dozen tests on our new configuration before we switched (granted, most of these were due to our change from Apache to Nginx). However, I feel confident that these tests were what made the migration go as smoothly as it did.
And, of course, the devil is in the details. Much of our migration strategy (including the minutia that I haven't explained above) was devised based on our knowledge of how visitors used the site, and how our site worked under the hood. In a highly interactive community, I think it might have made more sense to attempt to mirror databases for a brief period of time, rather than to do the rather abrupt cut-over that we did. But for us, the cut-over worked just fine.
====== Tons of thanks to warmnoise, my partner in crime on this project, who came up with most (maybe even all) of the good ideas.