When you sit down with consultants, or if you listen at any AWS conference, they’ll state that there is a right way and a wrong way to migrate. They’ll explain that it is vital to refactor your applications before moving them up to the cloud. They’ll tell you that trying to move your applications without changing them is futile, and that you’ll be setting yourself up to fail. They’ll tell you that it’s messy and drastically more difficult, nigh impossible, to migrate and then refactor. They’ll exclaim that the costs incurred by running your current platform as it stands is prohibitive. They will explain all this, and they mean it, and they are right… kind of.
The Forklift and the Pallets
“Lift and shift” or “shift and lift,” are the two most common ways to describe public cloud migration practices. By the sound of them, it seems they describe the same basic thing, just in two different orders. However, this is not entirely the case. At any AWS conference, they will explain that it is necessary to shift (refactor) before you lift (migrate). While it is ideal to proceed in this manner, it isn’t always an option. Fortunately, most of the public cloud offerings are expanded from the concept of hypervisors running virtual machines. This means that they all offer basic virtual servers and networking that can be created and used in a similar way as existing private infrastructure.
The reason refactoring is recommended before migrating is primarily because it costs more. It’s hard to sell the concept of moving to the cloud when you have to tell clients it will cost them significantly more than they’re currently paying for infrastructure. By changing applications to use the vendor offerings such as function as a service, database as a service, object storage, etc., the costs incurred are dramatically reduced. This level of refactoring requires a company-wide buy in, as in many cases it requires completely rewriting applications if they weren’t already designed to leverage such services. Depending on where the drive to the cloud is coming from, it may not be an option to have that level of developer involvement. In this case, the Infrastructure as a Service option is always there, but make sure your story weighs the benefits over the costs.
There are Always Trade-offs
The public cloud offerings are a great way for startups and small companies to have access to infrastructure without having to raise the capital required to purchase or lease the servers and networking components necessary to build it themselves. Moreover, most of the vendors offer the flexibility of paying on a metered or annual basis. This can allow for adaptation if a product suddenly or only occasionally requires a large amount of resources, because costs can be incurred for only the time necessary until resource usage is reduced.
While the technological benefits being offered are substantial, they do come at a price. As engineers, most of us expect a certain amount of authority over our environment, and by moving to someone else’s computers we give that up. Managers may think it’s a control issue, but it really comes down to accountability and predictability. We create environments with a level of redundancy and reliability known to us, and then set and manage expectations based on this known quality. When this is handled by a third party, we lose our insights into the functionality of our environment and with it our ability to anticipate failures and performance. We do eventually develop a level of understanding, but never the same level we have with a fully controlled environment. Additionally, our responsibility is not reduced, as we are still the owners of the environment, but can no longer resolve issues on the low level architecture. This leaves us at the mercy of the vendor when fixing or investigating outages.
What we get for the price of losing low-level access is the ability, for all intents and purposes, of infinite expandability. That is to say, lack of computing resources can no longer be considered an issue since you are no longer bound by the physical servers, storage, and CPUon hand, but have the resources of a company the size of Amazon, Google, Microsoft, et al at your disposal. Should a situation requiring a large amount of scaling arise, it can be done at a moment’s notice, and with hourly pricing the costs are kept at what is necessary to support the demand. With internal infrastructure, your options are limited: the application can fail (or run unacceptably slowly); more equipment can be purchased to support the high load periods, but would be wasted otherwise, and might not be purchased in time; or you can use a hybrid approach or service (which puts you in public cloud infrastructure anyway).
Additional benefits to leveraging public cloud infrastructure are difficult to truly qualify, because one hopes to never find out how beneficial it really is. Case in point: at my current place of employment, my initiative from hire has been to move our operations to “the cloud.” The real benefit was explained within the concept of disaster recovery. While our backup system permitted the recreation of our environment fairly readily and quickly, the requirement of purchasing, installing and configuring the new equipment necessary to recreate our production environment would have been prohibitive. By having access to near instantaneous resources at our fingertips, a full recovery of a catastrophic loss has gone from weeks to hours.
Workarounds
One of the most valuable lessons learned with our migration was the realization that even when you think you have all of the requirements, there may be some that don’t even register. This was the case with our database migration. We have two mid-sized databases (in the 2-4TB range), that we needed to get into the cloud. Because of licensing, we were unable to utilize database as a service offerings, and had to create virtual machines that we manually configured for the task. What we learned was that despite the block storage being solid state, the IOPs available to the servers at the computing and memory level we required was not even remotely high enough. Our finding showed that even though the storage was made up of SSDs with provisioned IOPs, the limitation set on the networking of smaller VMs kept write speeds around 60MB/s with bursts up to 120MB/s. While this is often unnoticeable on a majority of tasks, it didn’t even come close to the needed 300MB/s for our database to keep up with our applications. The resolution we discovered was to increase the size of our VM until it was able to receive 10Gbps networking which relieved the bottleneck, but presented its own problem. The minimum size VM required to be able to get 10Gbps networking put us over our CPU core limit of licensing, incurring new costs, because our vendor had no way of offering a compromise.
File shares are a vital part of our environment. We utilize an enterprise storage appliance to make management of the mixed NFS and CIFS environment easier and to utilize the active/active failover that it offers. These devices are fairly common in datacenters and there are a myriad of vendors that provide them. Most of these vendors also offer virtual appliances available in your major public cloud of choice. At the time of our migration, the appliance we preferred, or any of the others we investigated and trusted did not offer automatic failover between availability zones within AWS. As such, part of our migration required us moving over to a manual failover process. What this means: if the primary appliance fails we will have to manually change a CNAME in our DNS configuration and break a mirroring protocol between devices. This change, though inconvenient and not ideal, is not difficult, quickly propagates and was considered an acceptable loss in accordance to gains by being able to expand it as necessary, and having it span multiple datacenters.
Unavoidable Caveats
Most of the drawbacks in migrating are tied into the abstraction from the hardware and the reliance upon a vendor. While this seems simple enough, because obviously the resources available to these companies are vastly greater than most. What comes from this sprawling infrastructure is scale and the thing we often forget to consider when scaling is that doing so doesn’t just increase available resources, it increases risk. The likelihood of failure increases exponentially every time your add more devices to an environment. It’s an unavoidable fact. The more disks you have, the more likely it is that one will fail and while these companies work to build redundancy into all of their systems, it ultimately comes down to probability. Moving into the infrastructure of a company that has to worry about bit-flipping from cosmic rays in servers that utilize ECC memory should speak volumes about the rates of probability. At this level, it’s time to question what all those 9’s really mean.
Again, this is something to consider within the constructs of a migration. Despite claim of 99.999% availability on our block storage volumes, we have had several fail on us. This occurred on our most critical of systems, the database, on multiple occasions, and at one point a couple of weeks apart. In a private environment, this would be unheard of, and if it did happen would be indicative of a malfunctioning piece of equipment that would be replaced by the manufacturer. But in the cloud, failures aren’t actually unusual, but we were told it is unusual to have happen multiple times to a single company. Again, this comes down to probability. A company with that sheer size of infrastructure will have failures on a much more regular basis than a small private cloud. It’s simply the nature of reality.
Once again, assessment and mitigation is key. In our particular case, we run redundant failover databases, which meant we have had to flip our primary and rebuild a secondary respectively. The interference this causes in our environment is minimal and as such is acceptable in comparison to the cost increase required to configure redundancy within the database or operating system. This would not be the case in all implementations, and could easily be mitigated using operating system tools such as logical volume management within Linux.
Is it Worth Moving?
Shifting prior to lifting, though preferred, is not always necessary, because of the nature of the cloud. The benefits of doing so are substantial, as they result in dramatic infrastructure cost savings, increased reliability, and minimal administration but require a full buy in from all sides and incur the additional costs of completely rewriting most if not all applications. Simply migrating to the cloud to leverage infrastructure as a service allows instant and near infinite scaling, low cost entry into multi-datacenter redundancy and the ability to utilize periodic scaling during high-traffic or resource intensive times. It’s important to weigh the cost to benefit ratio across the board and be ready to solve unexpected problems.
At the end of the day, infrastructure teams need to be able to work with whatever is given to them and make sure they have the ability to configure their environment to fit the needs of the teams they work for. Whether in the public cloud or on private servers, the priority should be making sure the business requirements are met. Public cloud infrastructure offers resources and an agility difficult to attain on private infrastructure, but the driving force should always be the direction of the business.