Get to the Cloud!!

cloud-computing

When you sit down with consultants, or if you listen at any AWS conference, they’ll state that there is a right way and a wrong way to migrate. They’ll explain that it is vital to refactor your applications before moving them up to the cloud. They’ll tell you that trying to move your applications without changing them is futile, and that you’ll be setting yourself up to fail. They’ll tell you that it’s messy and drastically more difficult, nigh impossible, to migrate and then refactor. They’ll exclaim that the costs incurred by running your current platform as it stands is prohibitive. They will explain all this, and they mean it, and they are right… kind of.

The Forklift and the Pallets

“Lift and shift” or “shift and lift,” are the two most common ways to describe public cloud migration practices. By the sound of them, it seems they describe the same basic thing, just in two different orders. However, this is not entirely the case. At any AWS conference, they will explain that it is necessary to shift (refactor) before you lift (migrate). While it is ideal to proceed in this manner, it isn’t always an option. Fortunately, most of the public cloud offerings are expanded from the concept of hypervisors running virtual machines. This means that they all offer basic virtual servers and networking that can be created and used in a similar way as existing private infrastructure.

The reason refactoring is recommended before migrating is primarily because it costs more. It’s hard to sell the concept of moving to the cloud when you have to tell clients it will cost them significantly more than they’re currently paying for infrastructure. By changing applications to use the vendor offerings such as function as a service, database as a service, object storage, etc., the costs incurred are dramatically reduced. This level of refactoring requires a company-wide buy in, as in many cases it requires completely rewriting applications if they weren’t already designed to leverage such services. Depending on where the drive to the cloud is coming from, it may not be an option to have that level of developer involvement. In this case, the Infrastructure as a Service option is always there, but make sure your story weighs the benefits over the costs.

There are Always Trade-offs

The public cloud offerings are a great way for startups and small companies to have access to infrastructure without having to raise the capital required to purchase or lease the servers and networking components necessary to build it themselves. Moreover, most of the vendors offer the flexibility of paying on a metered or annual basis. This can allow for adaptation if a product suddenly or only occasionally requires a large amount of resources, because costs can be incurred for only the time necessary until resource usage is reduced.

While the technological benefits being offered are substantial, they do come at a price. As engineers, most of us expect a certain amount of authority over our environment, and by moving to someone else’s computers we give that up. Managers may think it’s a control issue, but it really comes down to accountability and predictability. We create environments with a level of redundancy and reliability known to us, and then set and manage expectations based on this known quality. When this is handled by a third party, we lose our insights into the functionality of our environment and with it our ability to anticipate failures and performance. We do eventually develop a level of understanding, but never the same level we have with a fully controlled environment. Additionally, our responsibility is not reduced, as we are still the owners of the environment, but can no longer resolve issues on the low level architecture. This leaves us at the mercy of the vendor when fixing or investigating outages.

What we get for the price of losing low-level access is the ability, for all intents and purposes, of infinite expandability. That is to say, lack of computing resources can no longer be considered an issue since you are no longer bound by the physical servers, storage, and CPUon hand, but have the resources of a company the size of Amazon, Google, Microsoft, et al at your disposal. Should a situation requiring a large amount of scaling arise, it can be done at a moment’s notice, and with hourly pricing the costs are kept at what is necessary to support the demand. With internal infrastructure, your options are limited: the application can fail (or run unacceptably slowly); more equipment can be purchased to support the high load periods, but would be wasted otherwise, and might not be purchased in time; or you can use a hybrid approach or service (which puts you in public cloud infrastructure anyway).

Additional benefits to leveraging public cloud infrastructure are difficult to truly qualify, because one hopes to never find out how beneficial it really is. Case in point: at my current place of employment, my initiative from hire has been to move our operations to “the cloud.” The real benefit was explained within the concept of disaster recovery. While our backup system permitted the recreation of our environment fairly readily and quickly, the requirement of purchasing, installing and configuring the new equipment necessary to recreate our production environment would have been prohibitive. By having access to near instantaneous resources at our fingertips, a full recovery of a catastrophic loss has gone from weeks to hours.

Workarounds

One of the most valuable lessons learned with our migration was the realization that even when you think you have all of the requirements, there may be some that don’t even register. This was the case with our database migration. We have two mid-sized databases (in the 2-4TB range), that we needed to get into the cloud. Because of licensing, we were unable to utilize database as a service offerings, and had to create virtual machines that we manually configured for the task. What we learned was that despite the block storage being solid state, the IOPs available to the servers at the computing and memory level we required was not even remotely high enough. Our finding showed that even though the storage was made up of SSDs with provisioned IOPs, the limitation set on the networking of smaller VMs kept write speeds around 60MB/s with bursts up to 120MB/s. While this is often unnoticeable on a majority of tasks, it didn’t even come close to the needed 300MB/s for our database to keep up with our applications. The resolution we discovered was to increase the size of our VM until it was able to receive 10Gbps networking which relieved the bottleneck, but presented its own problem. The minimum size VM required to be able to get 10Gbps networking put us over our CPU core limit of licensing, incurring new costs, because our vendor had no way of offering a compromise.

File shares are a vital part of our environment. We utilize an enterprise storage appliance to make management of the mixed NFS and CIFS environment easier and to utilize the active/active failover that it offers. These devices are fairly common in datacenters and there are a myriad of vendors that provide them. Most of these vendors also offer virtual appliances available in your major public cloud of choice. At the time of our migration, the appliance we preferred, or any of the others we investigated and trusted did not offer automatic failover between availability zones within AWS. As such, part of our migration required us moving over to a manual failover process. What this means: if the primary appliance fails we will have to manually change a CNAME in our DNS configuration and break a mirroring protocol between devices. This change, though inconvenient and not ideal, is not difficult, quickly propagates and was considered an acceptable loss in accordance to gains by being able to expand it as necessary, and having it span multiple datacenters.

Unavoidable Caveats

Most of the drawbacks in migrating are tied into the abstraction from the hardware and the reliance upon a vendor. While this seems simple enough, because obviously the resources available to these companies are vastly greater than most. What comes from this sprawling infrastructure is scale and the thing we often forget to consider when scaling is that doing so doesn’t just increase available resources, it increases risk. The likelihood of failure increases exponentially every time your add more devices to an environment. It’s an unavoidable fact. The more disks you have, the more likely it is that one will fail and while these companies work to build redundancy into all of their systems, it ultimately comes down to probability. Moving into the infrastructure of a company that has to worry about bit-flipping from cosmic rays in servers that utilize ECC memory should speak volumes about the rates of probability. At this level, it’s time to question what all those 9’s really mean.

Again, this is something to consider within the constructs of a migration. Despite claim of 99.999% availability on our block storage volumes, we have had several fail on us. This occurred on our most critical of systems, the database, on multiple occasions, and at one point a couple of weeks apart. In a private environment, this would be unheard of, and if it did happen would be indicative of a malfunctioning piece of equipment that would be replaced by the manufacturer. But in the cloud, failures aren’t actually unusual, but we were told it is unusual to have happen multiple times to a single company. Again, this comes down to probability. A company with that sheer size of infrastructure will have failures on a much more regular basis than a small private cloud. It’s simply the nature of reality.

Once again, assessment and mitigation is key. In our particular case, we run redundant failover databases, which meant we have had to flip our primary and rebuild a secondary respectively. The interference this causes in our environment is minimal and as such is acceptable in comparison to the cost increase required to configure redundancy within the database or operating system. This would not be the case in all implementations, and could easily be mitigated using operating system tools such as logical volume management within Linux.

Is it Worth Moving?

Shifting prior to lifting, though preferred, is not always necessary, because of the nature of the cloud. The benefits of doing so are substantial, as they result in dramatic infrastructure cost savings, increased reliability, and minimal administration but require a full buy in from all sides and incur the additional costs of completely rewriting most if not all applications. Simply migrating to the cloud to leverage infrastructure as a service allows instant and near infinite scaling, low cost entry into multi-datacenter redundancy and the ability to utilize periodic scaling during high-traffic or resource intensive times. It’s important to weigh the cost to benefit ratio across the board and be ready to solve unexpected problems.

At the end of the day, infrastructure teams need to be able to work with whatever is given to them and make sure they have the ability to configure their environment to fit the needs of the teams they work for. Whether in the public cloud or on private servers, the priority should be making sure the business requirements are met. Public cloud infrastructure offers resources and an agility difficult to attain on private infrastructure, but the driving force should always be the direction of the business.

AWS Outage: We Need to Talk About These Nines

AWS-Burn

I walked out of a meeting, preparing to go to lunch. One of the guys on the database team grabs me as I pass and informs me that his AWS (Amazon Web Services) permissions are broken. He’s unable to see any of his S3 buckets. I walk to my workstation, sit down, log into the console and find that none of our S3 buckets seem to exist. First thing’s first- let my boss, the director of technology, know, then run to the development directors to inform them. Grabbing my laptop, it’s back to the conference room with my director and a fellow sysadmin. Amazon’s status site asserted that everything was actually okay. Operations is doing what they do best – scrambling. Email notifications are going out to the technology department, the sales department, and the customer management and support teams. Two of the team members from Operations and Help Desk were at an AWS conference. They chime in on the email threads, digitally chortling about how the presenters had finished explaining that the eleven-nines of availability meant the S3 service would only go down once every 10 million years just before their presentation ground to a halt, because S3 was… unavailable.

This was a small outage for us. But an outage is an outage and they happen. Ultimately, they’re unavoidable because nothing is flawless. Mistakes will always happen eventually and Murphy has a law out there that’s still on the books. Within a few hours, the services were back up and our products recovered. It was a shock when it happened, but once we got our bearings, there was nothing we could do but accept it and wait. Notifications were out, and so it was time to monitor and send new ones when things were back up. This is both the benefit and burden of relying on someone else’s infrastructure. The aftermath seen in the headlines the next day is where it gets interesting. The tech world was awash in scolding sentiments about redundancy and proper architecture. There was a considerable amount of finger waving and condescension exclaiming that all those companies that suffered outages should have used multiple providers or at the very least multiple regions. But in all fairness, that’s not what these cloud service companies sell us.

Everyone in the industry knows that buzzwords are just that, words. Those of us in the trenches hear them day in and day out. They’re sometimes what gets a company to buy into a brilliant new project, and other times the thing that gets a company to push a futile, terrible and frustrating new project. The latest thing, for now, is to brag about the 9’s of uptime. Five-nines has become a misused claim that’s so pervasive very few people consider what it actually means anymore. Truly offering that level of uptime would mean that something is only unavailable a total of 5.26 minutes every year. So at eleven-nines, a five hour outage on S3 would be valid if the service didn’t have another outage for about 57.04 million years. I don’t think there’s anyone who doesn’t realize this is hyperbolic and that the vendors are really just trying to express an extremely high level of confidence (read hubris) in their product.

Technologists who work in an enterprise environment understand how important high availability is and the consequences of not configuring the proper level of redundancy. Those of us who have made the shift to a public cloud infrastructure are cautious about how readily we believe the claims these providers make. Every one of us has been burned at some point or another but we try to recognize why it happens and make sure we find a way to correct it or work around it. While we may balk at their exaggerations, at some point you have to trust your vendor. Companies aren’t in the habit of purchasing an EMC then going out and buying a 3Par as a back up when getting new SAN storage so why should the same mindset not apply to infrastructure as a service?

While I understand that companies like Netflix invest large sums of money into building suites of applications specifically made to cause outages so they can work to ensure their reliability, I also understand that not every company can afford the time, effort or assets to do this. The prospect of services such as AWS, Azure, Google Cloud and Digital Ocean are that companies can have access to the necessities of technological infrastructure in order to create, grow and innovate without the unattainable initial capital required to do so just ten years ago. With these services comes a certain expectation from their customer base that is in no way unwarranted. Amazon themselves got burned by this outage, as their status page was incorrect because it relied on the very services that went down. Most of the people who work with these things already knew they were relying on something that could fail and were taking a calculated risk in doing so, and those who didn’t know now. Unfortunately, mitigating those risks often takes a considerable effort by several times, and the driving forces behind it don’t always have the ability to allocate the necessary resources outside of their central team. Moments like this can sometimes be leveraged as proof of value for spending those resources, and those of us who lead these public infrastructure migrations try to do just that.

However, I question the validity of the allegations that customers should “know better.” These services are boasted as being unrealistically reliable, and while reason dictates that they can’t actually live up to their declarations, it shouldn’t be as far of a departure as it is. If a company is stating they have x-nines of reliability, they are asking to be relied upon. And when it’s a company like Amazon, an established technological powerhouse which not only embraces its position as the leader in public cloud offerings but redefined the market, those assertions should come with a level of accountability and expectation. We have to trust our vendors, so maybe it’s time they assess their claims and make them a bit more realistic.

Virtualization Cluster With CentOS 7

newvm-8

This article is an overview of the steps taken to create a small virtualization cluster built for fulfilling personal infrastructure requirements like file sharing, syncing, and trying new applications. Reasons for creating a cluster instead of a single server are to eliminate a single point of failure, and to allow hardware maintenance without interrupting services. The goal was to have something resilient that is easily maintained and simple to operate.

The Hardware

I’m a hardware guy. It’s not everyone’s cup of tea but I’m the kind of guy that remembers the specs of every system he’s ever had. As such, this might be a section that can be skimmed by those who just want the list or skipped for those that have no interest. Over the years, I’ve become accustomed to having a Windows desktop that I use for the rare things I find necessary and a Linux desktop that I use for everything else. Additionally, I’m an AMD fanboy. Though I am well aware of Intel’s superior performance both in processing power and energy consumption, I just can’t seem to shake “the feels” I get when buying AMD. That said, when it came time to build, my sensibilities (I’m a middle of the road kind of guy when it comes the performance) and loyalties sent me down the path of purchasing a couple of motherboard/processor combos that included an FX-6300. This served me well for a year, and then a regional computer store had a sale on FX-8320s which came with free motherboards.

I quickly purchased the upgrades and ended up with a couple of combinations laying around. This was ripe for projects, but with the Holiday Season approaching, using some of the parts to make a Christmas gift became too tempting. Shortly after, I decided that I would consolidate into a single desktop that I would dual-boot, and this left me with an additional computer. Deciding to upgrade the processor in my main desktop (since it would be my only one) to a FX-8370 left me with the two FX-8320/Motherboard combinations. For each of these, I purchased 16GB RAM, two 120GB solid state drives, and two 3TB hard disks. With this equipment, a couple of cheap ATX power supplies and a quick order of a couple cheap 2U cases from a popular Website, I was ready to go.

The Software

Though Ubuntu will likely always have a soft spot in my heart, being a Linux Systems Administrator in the US, I have a predilection for CentOS. With version 7 being available but still gaining usage in my production environment, this seemed like a solid project for getting familiar with the nuances in the new version, and it would provide a long supported and stable environment for my hypervisor. My method employs installing the base OS with OpenSSH and building from there.

Once the installation was complete, I logged in and proceeded to install using Package Groups, a feature in Yum, Red Hat’s package manager. These are a logical grouping of packages in order to complete specific tasks. The “Virtualization Host” group contains all the necessary packages for running a minimal hypervisor. The following command is used to initiate the install:

yum groupinstall "Virtualization Host"

Once the required packages were installed, I moved on with the preparation and configuration of my cluster.

The Storage

Gluster is the storage solution I decided to go with, as it is a versatile solution and it allows for easy migration of stored images across cluster nodes. What is Gluster? It’s a clustered file system Think of it as a distributed NAS. Storage nodes can be added and configured as mirrors, stripes, or combinations of the two (similar to RAID1, RAID0, and RAID10 respectively). Additionally, multiple nodes can be added to the storage cluster to expand storage without nodal redundancy. For this project, I decided to use a mirrored configuration. This creates a level of redundancy since both nodes will contain the same data. One of the benefits of Gluster is that the protocol writes to both nodes at the same time, so there is no syncing delay.

An interesting aspect of Gluster is that it uses a configured file system as the basis of its storage. Logical Volume Management (LVM) is my preferred method of configuring Linux storage and the following is an overview of the configuration I decided to build on the back end. I divided one of the 120GB SSDs into three partitions, 512mb as sda1 for /boot, 18GB as sda2 for a Physical Volume for the OS Volume Group, and the remaining space as sda3 for a Physical Volume for a Fast Storage Volume Group.. I then added the second SSD as another Physical Volume for the Fast Storage Volume Group. A Logical Volume (LV) for the Root (/) file system was carved out from the OS Volume Group, followed by an LV for VM Storage from the Fast Storage Volume Group. A Data Storage volume group was created out of the two 3TB hard disk drives, and a small (64GB) logical volume for ISO Storage as well as a striped logical volume for Data Storage were created. All logical volumes were formatted with XFS and then mounted as /glusterfs/vmstore, /glusterfs/isostore, and /glusterfs/datastore.

After configuring the storage volumes on the server, it’s time to set up Gluster. First step is to add the Gluster repos to you package manager. This can be done using the following command:

wget -P /etc/yum.repos.d http://download.gluster.org/pub/gluster/glusterfs/LATEST/RHEL/glusterfs-epel.repo

Once the repo has been added, you can install the server using yum

yum -y install glusterfs-server

This will need need to be completed on both of the servers to facilitate the redundancy that Gluster provides. After installing it on the second server, I proceeded to configure Gluster.

First I started the Gluster Daemon on both servers

systemctl start glusterd

Next, from the first server I established the cluster creation

gluster peer probe server2

In order to establish the configuration using hostnames, I ran the command on the second server

gluster peer probe server1

We can now create the volumes:

gluster volume create gfs_vmstore replica 2 server1:/glusterfs/vmstore/brick1 server2:/glusterfs/vmstore/brick1
gluster volume create gfs_isostore replica 2 server1:/glusterfs/isostore/brick1 server2:glusterfs/isostore/brick1
gluster volume create gfs_datastore replica 2 server1:/glusterfs/datastore/brick1 server2:glusterfs/datastore/brick1

Once the Gluster volumes have been established in a cluster, they will need to be mounted to be used. In this case, the servers are going to be both the clients and the servers, as such the mounts will need to be added to the fstab using the following lines.

Server1:/gfs_vmstore /data/vmstore glusterfs defaults,_netdev,backupvolfile-server=server2 0 0
server1:/gfs_isostore /data/isostore glusterfs defaults,_netdev,backupvolfile-server=server2 0 0
server1:/gfs_datastore /data/datastore glusterfs defaults,_netdev,backupvolfile-server=server2 0

Once the mounts have been added to the fstab, they need to be mounted

mount -a

Now that the hypervisor has been installed and the storage has been enabled, the server is ready to have the Virtualization environment configured. This can be done using the command line interface but using tools from a workstation can allow a faster creation and is easier to comprehend with a visualization.

The Interface

Virtual Machine Manager or Virt-Manager is a graphical tool for the Linux desktop that allows interfacing with libvirtd (an API for managing a myriad of hypervisors, notably KVM which was installed in the first part). I use Ubuntu flavors as my main desktop As such, the installation command for installing the application on my workstation is as follows:

sudo apt-get install virt-manager

Once the application is installed, I was able to start it and configure it to manage my virtualization cluster. The cluster servers need to be added to the interface. This is done by going to the file menu and selecting “Add Connection.” On the dialog box that opens, I left the default under “Hypervisor” (QEMU/KVM), checked “Connect to remote host,” left the method as SSH, left the username as root, and set the hostname to server1. I then did this again using the hostname of server2.

virtmgr-addserv2

Once the hosts are added to the interface, double clicking them will open a dialog for entering the root password you set at install.

virtmgr-baseview

Next, right clicking on the host and selecting “details” brings up a screen for configuring the host. At this point, I configured the storage.

Clicking the “storage” tab, and then the plus (+) button, a new dialog opens with a wizard for configuring a storage pool. I selected “File System Directory” and gave it a name of “VMStore.”

datastore-vmstore-name

On the next page I set the path of the directory to /data/vmstore.

datastore-vmstore-path

After clicking finish, it is now available in the storage list. I proceeded to configure the rest of the shares (DataStore – /data/datastore and ISOStore – /data/isostore).

Node-storage

Following the same steps on the second host causes the gluster directory to be the storage pool on both nodes which allows migrating VMs between the hosts.

Now that the cluster nodes are configured in Virt-Manager, I am able to proceed with uploading ISO files and provision my first virtual machine. To do this, I used SCP to upload the CentOS 7 ISO to the /data/isostore directory on server1.

scp ~/Downloads/CentOS-7-x86_64-Minimal-1503-01.iso root@dreaming01:/data/isostore

Doing this will cause the file to be accessible from either host. In the interest of observing Gluster functionality, the first VM will be a CentOS 7 server on server2 using the ISO file that was uploaded to server1 as the destination. Right click on the host, and select new, select “Local install media” and click forward.

newvm-1

The next screen lets you choose which ISO should be used for the installation.

newvm-2

Clicking “browse” browse opens a dialog that looks like the storage tab of the details menu. Selecting the ISOStore pool, it’s apparent that Gluster is functioning, since the CentOS ISO that was uploaded to server1 is available on server2.

newvm-selectiso

Next, the OS type and Version need to be selected. This loads specific configurations for the VM to improve performance.

newvm-3

Set the memory and processor count.

newvm-4
On the storage page, choose to “select or create custom storage”

newvm-5

Clicking Manage greets with a familiar storage dialog screen. Here, I go to the VMStore pool and click new volume (the “+” button above the storage volume contents), I then name the volume and set the capacity and click finish.

newvm-newvolume

Highlight the newly created volume, and click “choose volume.”

newvm-storagevolumesel

Once selected, it takes me back to the New VM wizard, and I click forward to the last screen.

newvm-7

On the final page, I name the VM, and click finish.

newvm-8

This opens a new dialog, and after entering the root password, the console of the new VM comes up. From here, I can begin installing the OS on my new VM.

newvm-console

Proceed with the normal CentOS installation process.