In my last post, (click here to read Part 5) I explained how we set up a Symmetrix Remote Data Facility (SRDF) bridge between our old and new datacenters that would allow us to use Storage VMotion to transfer VMs and data to our new Private Cloud datacenter. It worked very well. We could move VMs and data pretty effectively. However, setting them up and getting them to run an application was more of a challenge. We had to roll back one of the first three applications that we tried to migrate; the other two took us a long time to trouble shoot and configure.
The solution to minimize risk and downtime seemed obvious to me. It was just like a technology refresh in the physical world. Build a new environment with all new components and test it. Once all the bugs were worked out then you could synch the data and cutover. Why did I need to move a VM when making another one was just as easy and would provide an opportunity to configure and test it?
Debating a New Approach
During the first 90 days (blog 3) we did exactly that for the foundation applications and it worked great. We needed core IT applications like Domain Name Service (DNS), Active Directory, Dynamic Host Configuration Protocol (DHCP) to be running before we moved any business applications. We couldn’t move them from the legacy data center without breaking everything running there. Therefore, we met with the app teams for those core applications, took an order for servers and spun them up. The support teams took the opportunity to upgrade their servers vCPU, vRAM and OS. Once the servers were online, the application teams configured and tested them. Some were easy but some weren’t so we needed to rebuild them. In a little more than a month, all of those foundation applications were running on all new VMs.
The results made me wonder why we didn’t just build everything new. Nearly everybody on the team thought I was crazy. Some folks thought it would be too much work or thought it wasn’t best practice and others thought it didn’t leverage or display the new technology.
The debate raged on while we working on the next move events. They contained some physical servers too as we had no choice here. We weren’t using a truck to move any servers, so we had to build new VMs to replace the physical servers. Just like we did for the foundation applications, we met with the application teams to take their orders. And just like the foundation applications, the support teams took the opportunity to scale up, scale out, and upgrade. As soon as we handed them over their new machines they began configuring them, weeks before the migration was planned.
The next migration went off as expected. There were issues and problems with firewalls, DNS and application configuration. We stayed up all night Friday and all night Saturday completing the last application Sunday afternoon. It was brutal; the team was exhausted and stressed out. While we didn’t have to roll any of them back, Production was down that entire time. They weren’t mission-critical applications so there wasn’t a significant impact to the business.
I didn’t let up. We needed to find a better way. I had program managed several mission-critical applications. There is no way those applications could be down all weekend without serious impact to our business.
Opting to Clone VMs
I was constantly probing the pros and cons. Building a VM took work. Installing the application took work. If we created new hostnames, the application teams might have significant work to do to reconfigure their application. We needed to determine some ways to minimize the impact.
Rather than building a new VM and installing the application code we could clone the existing VM. If the VM is running Linux, it wouldn’t automatically update DNS like a Windows machine would. Therefore, we could have two VMs with the same name and then cutover the hostname DNS entry during the migration.
There would be significant benefits as well. It would give the application teams an opportunity to pre-configure and run performance and disaster recovery testing. We could also practice the migration and develop detailed plans of each step we needed to do down to the minute —taking an export of the database, copying it to the migration SRDF bridge, splitting the link, importing it to its new virtual home. And starting up the new instance of the application was much more effective at finding unknown configuration points than sitting in a conference room brainstorming about what we thought might happen.
We decided to compromise by pressing on with Storage VMotion over SRDF for the less critical applications and giving the mission-critical teams the ability to choose how they wanted to be migrated. I strongly lobbied them all to move to a whole new environment.
Turns out nearly all of those teams selected a new environment. They didn’t want to be stuck on a bridge line or in a conference room all weekend either. They also took the opportunity to upgrade and scale their environments. We upgraded OS, vCPU, vRAM, added more servers, upgrade database versions. They were energized to fix issues that had been causing them pain for years. There was still a lot of work, there were still issues that we had to fight through, but it wasn’t in the middle of the night. Most importantly, production was still up and running on the original machines.
Less Stress, Less Downtime
It was significantly less stressful for everyone involved. This approach spread the work out over a month or two rather than jamming everything into one weekend. We had a target date for each app. We knew if the testing and build weren’t complete, we could just slide the date a bit.
It was an incredibly busy time. Data and VMs were flying up and down the SRDF bridge. We were building/cloning VMs for this app, testing DR for that app, working on configuring another app and cutting over a non-production environment for another app. Of course we encountered some technical issues. A few application teams asked to upgrade their OS but couldn’t make it work so we had to revert and rebuild the servers a couple of times until we could get it running.
Once an application team had setup and migrated all of their non-production environments and had practiced migrating production a couple of times, the production migration event was nearly a non-issue. Flip DNS for some servers, a URL or two and export and import the database. We were typically done in a few hours. We even ventured to do some migrations during the week.
A few of the applications had active/active web and application tiers (meaning they were active in both locations). We activated the Durham tiers during peak production usage so we could monitor real time if the latency between the application tier and database servers caused an issue. For several weeks, we had one of our most mission critical applications running active/active/active on the application and web tier. It was awesome, zero seconds of downtime to bring up the Durham instance, zero seconds of downtime to drop the legacy data center servers from the cluster.
Not every application team opted for the new parallel environment approach. One app team felt it would be too much work. That migration ran into the buzz saw of unknown configuration points and had a three-day extended downtime. It was ugly. The application team worked feverously day and night all weekend and continued to have some functional issues for another week. They were cured. For another app they supported they opted for a parallel environment.
On the less critical applications, we were still using the Storage VMotion over SRDF technique. We were getting much better at the communication and handoffs and had more success. The migration team was gelling and many of the application teams were on their second or third move. We still encountered some issues. In those cases, we either rolled back to try again with a pre-built parallel environment, or just kept at it during the day Saturday, Sunday and maybe into the week. I estimate we were getting 50 percent success within the first 12 hours, 30 percent success between 12 and 24 hours, and 15 percent success between 24 and 48 hours. The remaining 5 percent needed to be rolled back or took another day or two to re-configure.
Each time we had an issue configuring an environment while trying to migrate it on the fly, it further validated that opting for a parallel pre-built, pre-configured environment was the best method to transform and migrate an application across the country or across the room.
In my upcoming blogs, I will discuss how we planned and upgraded capacity and decommissioned the legacy data center.source:itb