Encoding Performance - Comparing Zencoder, Encoding.com, Sorenson and Panda

A few months ago we were approached by Zencoder to conduct a sponsored performance comparison of 4 encoding services including their own. The purpose was to validate their claims of faster encoding times using an independent, credible external source. This was a new frontier for us. Our primary focus has been performance analysis of infrastructure as a service (IaaS). However, we are curious about all things related to cloud and benchmarking and we felt this could be useful data to make available publicly, so we accepted.

Testing Methodology

This is a description of the methodology we used for conducting this performance analysis.

Source Media

Following discussions with Zencoder, we opted to test encoding performance using 4 distinct media types. We were tasked with finding samples for each media type, they were not provided by Zencoder. All source media was stored in AWS S3 using the US East region (the same AWS region each of the 4 encoding services are hosted from). The 4 media types we used for testing are:

Read More

An unofficial EC2 outage postmortem - the sky is not falling

Last week Amazon Web Services (AWS) experienced a high profile outage affecting Elastic Cloud Compute (EC2) and Elastic Block Storage (EBS) in 1 of 4 data centers in the US East region. This outage caused some high profile website outages including Reddit, Quora and FourSquare and scores of negative PR. In the proceeding days media outlets and bloggers have written literally hundreds of articles such as Amazon's Trouble Raises Cloud Computing Doubts (New York Times), The Day The Cloud Died (Forbes), Amazon outage sparks frustration, doubts about cloud (Computerworld), and many others.

EC2 and EBS in a nutshell

In case you are not familiar with the technical jargon and acronyms, EBS is one of two methods provided by AWS for setting up an EC2 instance (an EC2 instance is essentially a server) storage volumes (basically a cloud hard drive). Unlike a traditional hard drive that is located physically inside of a computer, EBS is stored externally on dedicated storage boxes and connected to EC2 instances over a network. The second storage option provided by EC2 is called ephemeral, which uses this more traditional method of hard drives located physically inside the same hardware that an EC2 instance runs on. Using EBS is encouraged by AWS and provides some unique benefits not available with ephemeral storage. One such benefit is the ability to recover quickly from a host failure (a host is the hardware that an EC2 instance runs on). If the host fails for an EBS EC2 instance, it can quickly be restarted on another host because its storage does not reside on the failed host. On the contrary, if the host fails for an ephemeral EC2 instance, that instance and all of the data stored on it will be permanently lost. EBS instances can also be shutdown temporarily and restarted later, whereas ephemeral instances are deleted if shut down. EBS also theoretically provides better performance and reliability when compared to ephemeral storage.

Other technical terms you may hear and should understand regarding EC2 are virtualization and multi-tenancy. Virtualization allows AWS to run multiple EC2 instances on a single physical host by creating simulated "virtual" hardware environments for each instance. Without virtualization, AWS would have to maintain a 1-to-1 ratio between EC2 instance and physical hardware, and the economics just wouldn't work. Multi-tenancy is a consequence of virtualization in that multiple EC2 instances share access to physical hardware. Multi-tenancy often causes performance degradation in virtualized environments because instances may need to wait briefly to obtain access to physical resources like CPU, hard disk or network. The term noisy neighbor is often used to describe this scenario in very busy environments where virtual instances are waiting frequently for physical resources causing noticeable declines in performance.

EC2 is generally a very reliable service. Without a strong track record high profile websites like Netflix would not use it. We conduct ongoing independent outage monitoring of over 100 cloud services which shows 3 of the 5 AWS EC2 regions having 100% availability the past year. In fact, our own EBS backed EC2 instance in the affected US East region remained online throughout last week's outage.

AWS endorses a different type of architectural philosophy called designing for failure. In this context, instead of deploying highly redundant and fault tolerant (and very expensive) "enterprise" hardware, AWS uses low cost commodity hardware and designs their infrastructure to expect and deal gracefully with failure. AWS deals with failure using replication. For example, each EBS volume is stored on 2 separate storage arrays. In theory, if one storage array fails, its' volumes are quickly replaced with the backup copies. This approach provides many of the benefits of enterprise hardware, such as fault tolerance and resiliency, while at the same time providing substantially lower hardware costs enabling AWS to price their services competitively.

The outage - what went wrong?

Disclaimer: This is our own opinion of what occurred during last week's EC2 outage based on our interpretation of the comments provided on the AWS Service Health Dashboard and basic knowledge of the EC2/EBS architecture.

At about 1AM PST on Thursday April 21st, one of the four availability zones in the AWS US East region experienced a network fault that caused connectivity failures between EC2 instances and EBS. This event triggered a failover sequence wherein EC2 automatically swapped out the EBS volumes that had lost connectivity with backup copies. At the same time, EC2 attempted to create new backup copies of all of the affected EBS volumes (they refer to this as "re-mirroring"). While this procedure works fine for a few isolated EBS failures, this event was more widespread which created a very high load on the EBS infrastructure and the network that connects it to EC2. To make matters worse, some AWS users likely noticed problems and began attempting to restore their failed or poorly performing EBS volumes on their own. All of this activity appears to have caused a meltdown of the network connecting EC2 to EBS and exhausted the available EBS physical storage in this availability zone. Because EBS performance is dependent on network latency and throughput to EC2, and because those networks were saturated with activity, EBS performance became severely degraded, or in many cases completely failed. These issues likely bled into other availability zones in the region as users attempted to recover their services by launching new EBS volumes and EC2 instances in those availability zones. Overall, a very bad day for AWS and EC2.

Do SLAs really matter? A 1 year case study of 38 cloud services

In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.

The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.

SLA Credit Policies

In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:

The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.

Cloud Availability;
Cloud Outages;
Cloud Service Disruptions;
Storm on Demand
Read More

Introducing Web Services for Cloud Performance Metrics

Over the past year we've amassed a large repository of cloud benchmarks and metrics. Today, we are making most of that data available via web services. This data includes the following:

  • Available Public Clouds: What public clouds are around and which cloud services they offer including:
    • Cloud Servers/IaaS: e.g. EC2, GoGrid
    • Cloud Storage: e.g. S3, Google Storage
    • Content Delivery Networks/CDNs: e.g. Akamai, MaxCDN, Edgecast
    • Cloud Platforms: e.g. Google AppEngine, Microsoft Azure, Heroku
    • Cloud Databases: e.g. SimpleDB, SQL Azure
    • Cloud Messaging: e.g. Amazon SQS, Azure Message Queue
  • Cloud Servers: What instance sizes, server configurations and pricing are offered by public clouds. For example, Amazon's EC2 comes in 10 different instances sizes ranging from micro to 4xlarge. Our cloud servers pricing data includes typical hourly, daily, monthly pricing as well as complex pricing models such as spot pricing (dynamically updated) and reserve pricing where applicable
  • Cloud Benchmark Catalog: This includes names, descriptions and links to the benchmarks we run. Our benchmarks cover both system and network performance metrics
  • Cloud Benchmark Results: Access to our repository of 6.5 million benchmarks including advanced filtering, aggregation and comparisons. We are continually conducting benchmarks so this data is constantly being updated

We are releasing this data in hopes of improving transparency and making the comparison of cloud services easier. There are many ways that this data might be used. In this post, we'll go through a few examples to get you started and let you take it from there.

Our web services API provides both RESTful (HTTP query request and JSON or XML response) and SOAP interfaces. The API documentation and SOAP WSDLs are published here: http://api.cloudharmony.com/api

The sections below are separated into individual examples. This is not intended to be comprehensive documentation for the web services, but rather a starting point and a reference for using them. More comprehensive technical documentation is provided for each web service on our website.