A few months ago we were approached by Zencoder to conduct a sponsored performance comparison of 4 encoding services including their own. The purpose was to validate their claims of faster encoding times using an independent, credible external source. This was a new frontier for us. Our primary focus has been performance analysis of infrastructure as a service (IaaS). However, we are curious about all things related to cloud and benchmarking and we felt this could be useful data to make available publicly, so we accepted.
This is a description of the methodology we used for conducting this performance analysis.
Following discussions with Zencoder, we opted to test encoding performance using 4 distinct media types. We were tasked with finding samples for each media type, they were not provided by Zencoder. All source media was stored in AWS S3 using the US East region (the same AWS region each of the 4 encoding services are hosted from). The 4 media types we used for testing are:
In case you are not familiar with the technical jargon and acronyms, EBS is one of two methods provided by AWS for setting up an EC2 instance (an EC2 instance is essentially a server) storage volumes (basically a cloud hard drive). Unlike a traditional hard drive that is located physically inside of a computer, EBS is stored externally on dedicated storage boxes and connected to EC2 instances over a network. The second storage option provided by EC2 is called ephemeral, which uses this more traditional method of hard drives located physically inside the same hardware that an EC2 instance runs on. Using EBS is encouraged by AWS and provides some unique benefits not available with ephemeral storage. One such benefit is the ability to recover quickly from a host failure (a host is the hardware that an EC2 instance runs on). If the host fails for an EBS EC2 instance, it can quickly be restarted on another host because its storage does not reside on the failed host. On the contrary, if the host fails for an ephemeral EC2 instance, that instance and all of the data stored on it will be permanently lost. EBS instances can also be shutdown temporarily and restarted later, whereas ephemeral instances are deleted if shut down. EBS also theoretically provides better performance and reliability when compared to ephemeral storage.
Other technical terms you may hear and should understand regarding EC2 are virtualization and multi-tenancy. Virtualization allows AWS to run multiple EC2 instances on a single physical host by creating simulated "virtual" hardware environments for each instance. Without virtualization, AWS would have to maintain a 1-to-1 ratio between EC2 instance and physical hardware, and the economics just wouldn't work. Multi-tenancy is a consequence of virtualization in that multiple EC2 instances share access to physical hardware. Multi-tenancy often causes performance degradation in virtualized environments because instances may need to wait briefly to obtain access to physical resources like CPU, hard disk or network. The term noisy neighbor is often used to describe this scenario in very busy environments where virtual instances are waiting frequently for physical resources causing noticeable declines in performance.
EC2 is generally a very reliable service. Without a strong track record high profile websites like Netflix would not use it. We conduct ongoing independent outage monitoring of over 100 cloud services which shows 3 of the 5 AWS EC2 regions having 100% availability the past year. In fact, our own EBS backed EC2 instance in the affected US East region remained online throughout last week's outage.
AWS endorses a different type of architectural philosophy called designing for failure. In this context, instead of deploying highly redundant and fault tolerant (and very expensive) "enterprise" hardware, AWS uses low cost commodity hardware and designs their infrastructure to expect and deal gracefully with failure. AWS deals with failure using replication. For example, each EBS volume is stored on 2 separate storage arrays. In theory, if one storage array fails, its' volumes are quickly replaced with the backup copies. This approach provides many of the benefits of enterprise hardware, such as fault tolerance and resiliency, while at the same time providing substantially lower hardware costs enabling AWS to price their services competitively.
The outage - what went wrong?
Disclaimer: This is our own opinion of what occurred during last week's EC2 outage based on our interpretation of the comments provided on the AWS Service Health Dashboard and basic knowledge of the EC2/EBS architecture.
At about 1AM PST on Thursday April 21st, one of the four availability zones in the AWS US East region experienced a network fault that caused connectivity failures between EC2 instances and EBS. This event triggered a failover sequence wherein EC2 automatically swapped out the EBS volumes that had lost connectivity with backup copies. At the same time, EC2 attempted to create new backup copies of all of the affected EBS volumes (they refer to this as "re-mirroring"). While this procedure works fine for a few isolated EBS failures, this event was more widespread which created a very high load on the EBS infrastructure and the network that connects it to EC2. To make matters worse, some AWS users likely noticed problems and began attempting to restore their failed or poorly performing EBS volumes on their own. All of this activity appears to have caused a meltdown of the network connecting EC2 to EBS and exhausted the available EBS physical storage in this availability zone. Because EBS performance is dependent on network latency and throughput to EC2, and because those networks were saturated with activity, EBS performance became severely degraded, or in many cases completely failed. These issues likely bled into other availability zones in the region as users attempted to recover their services by launching new EBS volumes and EC2 instances in those availability zones. Overall, a very bad day for AWS and EC2.
In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.
The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.
SLA Credit Policies
In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:
Pro-rated Credit (Pro-rated): Credit is based on a simple pro-ration on the amount of downtime that exceeded the SLA guarantee. Credit is issued based on that calculated exceedance and a credit multiple ranging from 1X (Linode) to 100X (GoGrid) (e.g. with GoGrid a 1 hour outage gets a 100 hour service credit). Credit is capped at 100% of service fees (i.e. you can't get more in credit than you paid for the service). Generally SLA credits are just that, service credit and not redeemable for a refund
Threshold Credit (Threshold): Threshold-based SLAs may provide a high guaranteed availability, but credits are not valid until the outage exceeds a given threshold time (i.e. the vendor has a certain amount of time to fix the problem before you are entitled to a service credit). For example, SoftLayer provides a network 100% SLA, but only issues SLA credit for continuous network outages exceeding 30 minutes
Percentage Credit (Percentage): This SLA credit policy discounts your next invoice X% based on the amount of downtime and the stated SLA. For example, EC2 provides a 10% monthly invoice credit when annual uptime falls below 99.5%
The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.