Thursday, November 17, 2011

Is Joyent Really 14X Faster than EC2 and Azure the "Fastest Cloud"? Questions to Ask About Benchmark Studies

Many are skeptical of claims that involve benchmarks. Over the years benchmarks have been manipulated and misrepresented. Benchmarks aren't inherently bad or created in bad faith. To the contrary, when understood and applied correctly, benchmarks can often provide useful insight for performance analysis and capacity planning. The problem with benchmarks is they are often misunderstood or misrepresented, frequently resulting in bold assertions and questionable claims. Oftentimes there are also extraneous factors involved such as agenda-driven marketing organizations. In fact, the term "benchmarketing" was coined to describe questionable marketing-driven, benchmark-based claims. This post will discuss a few questions one might consider when reading benchmark-based claims. We'll then apply these questions to 2 recent cloud related, benchmark-based studies.

Questions to consider

The following are 7 questions one might ask when considering benchmark-based claims. Answering these questions will help to provide a clearer understanding on the validity and applicability of the claims.
  1. What is the claim? Typically the bold-face, attention grabbing headline like Service Y is 10X faster than Service Z
  2. What is the claimed measurement? Usually implied by the headline. For example the claim Service Y is 10X faster than Service Z implies a measurement of system performance
  3. What is the actual measurement? To answer this question, look at the methodology and benchmark(s) used. This may require some digging, but can usually be found somewhere in the article body. Once found, do some research to determine what was actually measured. For example, if Geekbench was used, you would discover the actual measurement is processor and memory performance, but not disk or network IO
  4. Is it an apples-to-apples comparison? The validity of a benchmark-based claim ultimately depends on the fairness of the testing methodology. Claims involving comparisons should compare similar things. For example, Ford could compare a Mustang Shelby GT500 (top speed 190 MPH) to a Chevy Aveo (top speed 100 MPH) and claim their cars are nearly twice as fast, but the Aveo is not a comparable vehicle and therefore the claim would be invalid. A more fair, apples-to-apples comparison would be a Mustang GT500 and a Chevy Camaro ZL1 (top speed 186).
  5. Is the playing field level? Another important question to ask is whether or not there are any extraneous factors that provided an unfair advantage to one test subject over another. For example, using the top speed analogy, Ford could compare a Mustang with 92 octane fuel and a downhill course to a Camaro with 85 octane fuel and an uphill course. Because there are extraneous factors (fuel and angle of the course) which provided an unfair advantage to the Mustang, the claim would be invalid. To be fair, the top speeds of both vehicles should be measured on the same course, with the same fuel, fuel quantity, driver and weather conditions.
  6. Was the data reported accurately? Benchmarking often results in large datasets. Summarizing the data concisely and accurately can be challenging. Things to watch out for include lack of good statistical analysis (i.e. reporting average only), math errors, and sloppy calculations. For example, if large, highly variable data is collected, it is generally a best practice to report the median value in place of mean (average) to mitigate the effects of outliers. Standard deviation is also a useful metric to include to identify data consistency.
  7. Does it matter to you? The final question to ask is, assuming the results are valid, does it actually mean anything to you? For example, purchasing a vehicle based on a top speed comparison is not advisable if fuel economy is what really matters to you.

Case Study #1: Joyent Cloud versus AWS EC2

In this case study, Joyent sponsored a third party benchmarking study to compare Joyent Cloud to AWS EC2. The study utilized our own (CloudHarmony) benchmarking methodology to compare 3 categories of performance: CPU, Disk IO and Memory. The end results of the study are published on the Joyent website available here. In the table below, we'll apply the questions listed above to this study. Answers will be color coded green where the study provided a positive response to the question, and red where the results are misleading or misrepresented.
Questions & Answers

What is the claim? Joyent Cloud is 3x - 14x Faster than AWS EC2
The claims are broken down by measurement type (CPU, Disk IO, Memory), and OS type (SmartOS/Open Solaris, Linux). The resulting large, colorful icons on the Joyent website claim that Joyent Cloud is faster than EC2 by a margin of 3x - 14x
What is the claimed measurement? CPU, Disk IO, Memory Performance
Our benchmarking methodology was used to measure these different categories of performance. This methodology consists of running multiple benchmarks per category and creating a composite measurement based on a summary of the results for all benchmarks in each category. The methodology is described in more detail on our blog here (CPU), here (Disk IO) and here (Memory).
What is the actual measurement? CPU, Disk IO, Memory Performance
Is it an apples-to-apples comparison? Dissimilar instance types were compared
In the Linux comparison, Joyent claims 5x faster CPU, 3x faster Disk IO, and 4x faster memory. Based on the report details, it appears those ratios originate from comparing a 1GB Joyent VM to an EC2 m1.small. This selection provided the largest performance differential and hence the biggest claim. While price-wise, these instance types are similar (disregarding m1.small spot and reserve pricing where it is 1/2 the cost), that is where the similarities stop. At the time of this report, m1.small was the slowest EC2 instance with a single core and older CPU, while Joyent's 1GB instance type has 2 burstable cores and a newer CPU. The m1.small is not intended for compute intensive tasks. For that type of workload EC2 offers other options with newer CPUs and more cores. To provide an apples-to-apples comparison on performance, the claim should be based on 2 instance types that are intended for such a purpose (e.g. an EC2 m2 or cc1).
Is the playing field level? Operating system and storage type were different
The study compares Joyent Cloud VMs running SmartOS or Ubuntu 10.04 to AWS EC2 VMs running CentOS 5.4. Joyent's SmartOS is based on Open Solaris and highly optimized for the Joyent environment. Ubuntu 10.04 uses Linux Kernel 2.6.32 (release date: Dec 2009) which is over 3 years newer than the 2.6.18 kernel (release date: Sep 2006) in CentOS 5.4. Newer and more optimized operating systems will almost always perform better for similar tasks on identical hardware. This provided an advantage to the Joyent VMs from the offset.

Additionally, the tests compared EC2 instances running on networked storage (EBS) to Joyent instances running on local storage, which also provided an advantage to the Joyent VMs for the disk IO benchmarks.
Was the data reported accurately? Mistakes were made in calculations
This study was based on a cloud performance comparison methodology we (CloudHarmony) developed for a series of blog posts in 2010. For CPU performance, we developed an algorithm that combined the results of 19 different CPU benchmarks to provide a single performance metric that attempts to approximate the AW ECU (Elastic Compute Unit). To do so, we utilized EC2 instances and their associated ECU value as a baseline. We called this metric CCU and the algorithm for producing it was described in this blog post. Part of the algorithm involved calculating CCU when performance exceeded the largest baseline EC2 instance type, the 26 ECU m2.4xlarge. In our algorithm we used the performance differential ratio between an m1.small (1 ECU) and m2.4xlarge (26 ECUs). The third party, however, used the ratio between an m2.2xlarge (13 ECUs) and m2.4xlarge (26 ECUs). Because m2s run on the same hardware type, the performance difference between an m2.2xlarge and an m2.4xlarge is not very great, but the difference in ECUs is very high. The end results was their calculations producing a very high CCU value for the Joyent instances (in the range of 58-67 CCUs). Had the correct algorithm been used, the reported CCUs would have been much lower.
Does it matter to you? Probably not
There isn't much value or validity to the data provided in these reports. The bold headlines which state Joyent Cloud is 3X - 14X faster than EC2 are based on very shaky grounds. In fact, with Joyent's approval, we recently ran our benchmarks in their environment resulting in the following CPU, disk IO and memory performance metrics: CloudHarmony Generated Joyent/EC2 Performance Comparison
CPU Performance: AWS EC2 vs Joyent
View Full Report
Provider Instance Type Memory Cost CCU
EC2 cc1.4xlarge 23 GB $1.30/hr 33.5
Joyent XXXL 48GB (8 CPU) 48 GB $1.68/hr 28.44
EC2 m2.4xlarge 68.4 GB $2.00/hr 26
EC2 m2.2xlarge 34.2 GB $1.00/hr 13
Joyent XL 16GB (3 CPU) 16 GB $0.64/hr 10.94
Joyent XXL 32GB (4 CPU) 32 GB $1.12/hr 6.82
EC2 m2.xlarge 17.1 GB $0.50/hr 6.5
Joyent Large 8GB (2 CPU) 8 GB $0.36/hr 6.19
Joyent Medium 4GB (1 CPU) 4 GB $0.24/hr 5.53
Joyent Medium 2GB (1 CPU) 2 GB $0.17/hr 5.45
Joyent Small 1GB (1 CPU) 1 GB $0.085/hr 4.66
EC2 m1.large 7.5 GB $0.34/hr 4
EC2 m1.small 1.7 GB $0.085/hr 1
Disk IO Performance: AWS EC2 vs Joyent
View Full Report - Note: the EC2 instances labeled EBS utilized a single networked storage volume - better performance may be possible using local storage or multiple EBS volumes. All Joyent instances utilized local storage (networked storage is not available).
Provider Instance Type Memory Cost IOP
EC2 cc1.4xlarge (local storage - raid 0) 23 GB $1.30/hr 212.06
EC2 cc1.4xlarge (local storage) 23 GB $1.30/hr 194.29
Joyent XXXL 48GB (8 CPU) 48 GB $1.68/hr 187.38
Joyent XL 16GB (3 CPU) 16 GB $0.64/hr 144.71
Joyent XXL 32GB (4 CPU) 32 GB $1.12/hr 142.19
Joyent Large 8GB (2 CPU) 8 GB $0.36/hr 130.84
Joyent Medium 4GB (1 CPU) 4 GB $0.24/hr 110.78
Joyent Medium 2GB (1 CPU) 2 GB $0.17/hr 109.2
EC2 m2.2xlarge (EBS) 34.2 GB $1.00/hr 87.58
EC2 m2.xlarge (EBS) 17.1 GB $0.50/hr 83.62
EC2 m2.4xlarge (EBS) 68.4 GB $2.00/hr 82.79
EC2 m1.large (EBS) 7.5 GB $0.34/hr 56.82
Joyent Small 1GB (1 CPU) 1 GB $0.085/hr 56.08
EC2 m1.small (EBS) 1.7 GB $0.085/hr 27.08
Memory Performance: AWS EC2 vs Joyent
View Full Report
Provider Instance Type Memory Cost CCU
EC2 cc1.4xlarge 23 GB $1.30/hr 137.2
EC2 m2.2xlarge 34.2 GB $1.00/hr 109.41
EC2 m2.4xlarge 68.4 GB $2.00/hr 109.14
EC2 m2.xlarge 17.1 GB $0.50/hr 103.35
Joyent XL 16GB (3 CPU) 16 GB $0.64/hr 100.87
Joyent XXXL 48GB (8 CPU) 48 GB $1.68/hr 92.5
Joyent XXL 32GB (4 CPU) 32 GB $1.12/hr 90.79
Joyent Large 8GB (2 CPU) 8 GB $0.36/hr 90.37
Joyent Medium 2GB (1 CPU) 2 GB $0.17/hr 84.2
Joyent Small 1GB (1 CPU) 1 GB $0.085/hr 78.51
Joyent Medium 4GB (1 CPU) 4 GB $0.24/hr 76.04
EC2 m1.large 7.5 GB $0.34/hr 61.8
EC2 m1.small 1.7 GB $0.085/hr 22.24

Case Study #2: Microsoft Azure Named Fastest Cloud Service

In October 2011, Compuware published a blog post related to cloud performance. This post was picked up by various media outlets resulting in the following headlines:
Here's how the test worked in a nutshell:
  • Two sample e-commerce web pages were created. The first with items description and 40 thumbnails (product list page), and the second with a single 1.75 MB image (product details page)
  • These pages were made accessible using a Java application server (Tomcat 6) running in each cloud environment. The exception to this is Microsoft Azure and Google AppEngine (platform-as-a-service/PaaS environments) which required the pages to be bundled and deployed using their specific technology stack
  • 30 monitoring servers/nodes were instructed to request these 2 pages in succession every 15 minutes and record the amount of time it took to render both in their entirety (including the embedded images)
  • The 30 monitoring nodes are located in data centers in North America (19), Europe (5), Asia (3), Australia (1) and South America (2) - they are part of the Gomez Performance Network (GPN) monitoring service
  • After 1 year an average response time was calculated for each service (response times above 10 seconds were discarded)
Now lets dig a little deeper...
Questions & Answers

What is the claim? Microsoft Azure is the "fastest cloud"
What is the claimed measurement? Overall performance (it's fastest)
What is the actual measurement? Network Latency & Throughput
Rendering 2 html pages and some images is not CPU intensive and as such is not a measure of system performance. The main bottleneck is network latency and throughput, particularly to distant monitoring nodes (e.g. Australia to US)
Is it an apples-to-apples comparison? Types of services tested are different (IaaS vs PaaS) and the instance types are dissimilar
Microsoft Azure and Google AppEngine are platform-as-a-service (PaaS) environments, very different from infrastructure-as-a-service (IaaS) environments like EC2 and GoGrid. With PaaS, users must package and deploy applications using custom tools and more limited capabilities. Applications are deployed to large clustered, multi-tenant environments. Because of the greater structure and more limited capabilities of PaaS, providers are able to better optimize and scale those applications, often resulting in better performance and availability when compared to a single server IaaS deployment. Not much information is disclosed regarding the sizes of instances used for the IaaS services. With some IaaS providers, network performance can vary depending on instance size. For example, with Rackspace Cloud, a 256MB cloud server is capped with a 10 Mbps uplink. With EC2, bandwidth is shared across all instances deployed to a physical host. Smaller instance sizes generally have less, and more variable bandwidth. This test was conducted using the nearly smallest EC2 instance size, an m1.small.
Is the playing field level? Services may have unfair advantage due network proximity and uplink performance
Because network latency is the main bottleneck for this test, and only a handful of monitoring nodes were used, the results are highly dependent on network proximity and latency between the services tested and the monitoring nodes. For example, the Chicago monitoring node might be sitting in the same building as the Azure US Central servers giving Azure and unfair advantage in the test. Additionally, the IaaS services where uplinks are capped on smaller instance types would be at a disadvantage to uncapped PaaS and IaaS environments.
Was the data reported accurately? Simple average was reported - no median, standard deviation or regional breakouts were provided
The CloudSleuth post provided a single metric only… the average response time for each service across all monitoring nodes. A better way to report this data would involve breaking the data down by region. For example, average response time for eastern US monitoring nodes. Reporting median, standard deviation and 90th percentile statistical calculations would also be very helpful in evaluating the data.
Does it matter to you? Probably not
Unless your users are sitting in the same 30 data centers as the GPN monitoring nodes, this study probably means very little. It does not represent a real world scenario where static content like images would be deployed to a distributed content delivery network like CloudFront or Edgecast. It attempts to compare two different types of cloud services, PaaS and IaaS. It may use IaaS instance types like the EC2 m1.small that represent the worst case performance scenario. The 30 node test population is also very small and not indicative of a real end user population (end users don't sit in data centers). Finally, reporting only a single average value ignores most statistical best practices.

Monday, October 17, 2011

Encoding Performance: Comparing Zencoder, Encoding.com, Sorenson & Panda

A few months ago we were approached by Zencoder to conduct a sponsored performance comparison of 4 encoding services including their own. The purpose was to validate their claims of faster encoding times using an independent, credible external source. This was a new frontier for us. Our primary focus has been performance analysis of infrastructure as a service (IaaS). However, we are curious about all things related to cloud and benchmarking and we felt this could be useful data to make available publicly, so we accepted.

Testing Methodology

This is a description of the methodology we used for conducting this performance analysis.

Source Media

Following discussions with Zencoder, we opted to test encoding performance using 4 distinct media types. We were tasked with finding samples for each media type, they were not provided by Zencoder. All source media was stored in AWS S3 using the US East region (the same AWS region each of the 4 encoding services are hosted from). The 4 media types we used for testing are:
  • HD Video: We chose an HD 1080P trailer for the movie Avatar. This file was 223.1 MB in size and 3 mins, 30 secs in duration.
  • SD Video: We chose a 480P video episode from a cartoon series. The file was 519.2 MB in size and about 23 mins in duration.
  • Mobile Video: We created a 568x320 video using an iPhone (source here). The file was 2.9 MB in size, 30 secs in duration.
  • MP3 Audio: We used an MP3 file we found on the web about Yoga (source here). The file was 42.2 MB in size, 58 mins 41 secs in duration.

Encode Settings

We used the same encode options across all of the services tested. The following is a summary of the encode options used for each corresponding media type:
Media Type Video Codec Video Bitrate Audio Codec Audio Bitrate Encode Passes
HD Video H.264 3000 Kb/s AAC 96 Kb/s 2
SD Video H.264 1000 Kb/s AAC 96 Kb/s 2
Mobile Video H.264 500 Kb/s AAC 96 Kb/s 2
MP3 Audio NA NA AAC 96 Kb/s 2

Test Scheduling

Testing was conducted during a span of 1 week. We built a test harness that integrated with the APIs of each of the 4 encoding services. The test harness invoked 2 test iterations daily with each service. Testing included both single request and 4 parallel requests. Each test iteration consisted of the following 8 test scenarios:
  • Single HD Video Request
  • Single SD Video Request
  • Single Mobile Video Request
  • Single MP3 Audio Request
  • 4 Parallel HD Video Requests
  • 4 Parallel SD Video Requests
  • 4 Parallel Mobile Video Requests
  • 4 Parallel MP3 Audio Requests
The order of the test scenarios was randomized, but the same tests were always requested at the same time for each service. Each test was run to completion on all services before the next test was invoked. The start times for 2 daily test iterations were separated by about 12 hours and incremented by 100 minutes each day. The end result was a distribution of test scenarios during many different times of the day. A total of 112 tests were performed during the 1 week test span, followed by an additional 24 test scenarios on encoding.com to test different combinations of service specific encoding options (described below).

Performance Metrics

During testing, we captured the following metrics:
  • Encode Time: The amount of time in seconds required to encode the media (excludes transfer time)
  • Transfer Time: The amount of time in seconds required to transfer the source media from AWS S3 into the service processing queue
  • Failures: When a service failed to complete a job

Test Results

The following is a summary of the results for each of the 8 test scenarios. Result tables have the following columns:
  • Avg Encode Time: The mean (average) encoding time in seconds for all jobs in this test scenario
  • Standard Deviation %: Standard deviation as percentage of the mean. A lower values indicates more consistency in performance
  • Median Encode Time: The median encoding time in seconds for all jobs in this test scenario
  • Avg Total Time: The mean (average) total job time in seconds. The total time is the sum of transfer (source files were hosted in AWS S3 US East), queue and encoding times, essentially the total turnaround time from the moment the job was submitted. Sorenson and Panda may cache source media files, thus reducing transfer time for future requests

A graph is displayed below each results table depicting sorted average encode and total times using a dual series horizontal bar chart. Also depicted on the graph is a line indicating the actual duration of the source media. Encode time bars that terminate to the left of this line signify faster than realtime encoding performance.

encoding.com offers a few different encoding job options that can affect encode performance. These options include the following:

  • instant: using the instant option, encoding.com will begin encoding before the source media has been fully downloaded. For larger source files, this can decrease encode times
  • twin turbo: this setting causes jobs to be delegated to faster servers in exchange for paying a $2/GB premium for encoding. encoding.com states that this option will deliver 6-8X faster encoding time over standard servers

Test Results: Single HD Video Request

The following are the results from a total of 14 HD video encode jobs submitted at various times of the day over a period of 1 week:
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 141.08 6 137 155.08
encoding.com 2 (max+TT) 213.38 16 193 244.53
encoding.com 1 (max+TT+instant) 213.5 11 213.5 238
encoding.com 4 (plus+TT+instant) 226.75 11 235.5 258.75
encoding.com 3 (max) 618.75 12 643.5 653.25
sorenson 974.31 2 975 982.46
panda 1246.31 4 1247 1255.93

Failures - None

Test Results: Single SD Video Request

The following are the results from a total of 14 SD video encode jobs submitted at various times of the day over a period of 1 week:
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 229.14 9 225.5 281.64
encoding.com 4 (plus+TT+instant) 453 1 453.5 718.75
encoding.com 2 (max+TT) 454.07 11 458.5 689.21
encoding.com 1 (max+TT+instant) 474 7 474 732.5
encoding.com 3 (max) 1137.75 12 1162 1429
panda 1649.36 7 1640.5 1673.93
sorenson 2087.23 8 2052 2101

Failures

  • Sorenson: 1

Test Results: Single Mobile Video Request

The following are the results from a total of 14 mobile video encode jobs submitted at various times of the day over a period of 1 week. NOTE: During our testing, encoding.com jobs would occasionally experience excessive queue times with the status "Waiting for encoder". This is the reason for the long green section representing the transfer/queue time delta on the graph below.
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 19.38 12 20 30.23
panda 25.69 4 25 26.31
encoding.com 4 (plus+TT+instant) 35 7 34 122.75
encoding.com 3 (max) 40.25 9 40.5 51.25
sorenson 47 6 46 68.31
encoding.com 2 (max+TT) 76.92 139 31 88.23
encoding.com 1 (max+TT+instant) 97 67 97 112

Failures - None

Test Results: Single MP3 Audio Request

The following are the results from a total of 14 audio encode jobs submitted at various times of the day over a period of 1 week. In this test scenario, we again experience some long queue times during one of the encoding.com test phases.
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 120.31 11 115 131
sorenson 189.38 11 182 196.23
panda 218.58 3 217 222.16
encoding.com 1 (max+TT+instant) 221 0 221 255
encoding.com 4 (plus+TT+instant) 224.75 9 225 248.75
encoding.com 3 (max) 240.5 7 238 355.75
encoding.com 2 (max+TT) 328.36 52 254 360.45

Failures

  • encoding.com max+TT+instant: 1
  • encoding.com max+TT: 2

Test Results: 4 Parallel HD Video Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) HD video encode jobs submitted at various times of the day over a period of 1 week:
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 141.23 4 140 153.71
encoding.com 1 (max+TT+instant) 236.88 19 250 280.13
encoding.com 2 (max+TT) 292.21 46 259.5 321.96
encoding.com 4 (plus+TT+instant) 298.69 37 256.5 354.44
encoding.com 3 (max) 723.44 25 751.5 796.38
panda 1322.19 17 1249 1336.9
sorenson 1751.44 3 1745 1769.71

Failures - None

Test Results: 4 Parallel SD Video Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) SD video encode jobs submitted at various times of the day over a period of 1 week:
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 240.48 17 224.5 327.98
encoding.com 2 (max+TT) 439.5 21 445 700.15
encoding.com 4 (plus+TT+instant) 461.56 14 459.5 757.87
encoding.com 1 (max+TT+instant) 484.88 10 459 770.01
encoding.com 3 (max) 1274.63 11 1274 1552.69
panda 1680.96 10 1648.5 1710.9
sorenson 3777.54 10 3835 3808.94

Failures - None

Test Results: 4 Parallel Mobile Video Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) mobile video encode jobs submitted at various times of the day over a period of 1 week. In this test scenario, we again experience some long queue times during one of the encoding.com test phases.
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 20.4 15 20 27.27
panda 26.65 7 26 27.21
encoding.com 1 (max+TT+instant) 34 8 34 49.75
sorenson 55.2 15 53 77.54
encoding.com 4 (plus+TT+instant) 68 60 44.5 106.63
encoding.com 2 (max+TT) 87.46 100 37 104.92
encoding.com 3 (max) 115.88 197 46 149.32

Failures

  • sorenson: 8

Test Results: 4 Parallel MP3 Audio Requests

The following are the results from a total of 56 (14 sets of 4 parallel requests) audio encode jobs submitted at various times of the day over a period of 1 week:
Service Avg Encode Time Standard Deviation % Median Encode Time Avg Total Time
zencoder 115.62 6 114 126.16
sorenson 187.24 5 184 204.57
panda 225.15 17 214 227.92
encoding.com 4 (plus+TT+instant) 234.38 17 228.5 266.63
encoding.com 2 (max+TT) 265.43 19 261 313.56
encoding.com 1 (max+TT+instant) 273.67 17 274 344
encoding.com 3 (max) 285.69 31 259 317.15

Failures

  • sorenson: 3
  • encoding.com max+TT+instant: 5
  • encoding.com max+TT: 5
  • encoding.com max: 3

Encoding Service Accounts

The following encoding services are included in this analysis: Zencoder, encoding.com, Sorenson Media, and Panda. Each service offers different account and pricing options. We setup an account with each service (including Zencoder) using their standard signup process. Because 4 parallel job requests were part of the testing, we opted for service plans that would limit the effects of queue time under such conditions. The pricing data below is for informational purposes only. The purpose of this post is not to compare service pricing, as there are simply too many variations between services to be able to do so. The following is a summary of the account options we selected and pricing with each service:
Service Plan Used Plan Cost Plan Usage Included Encoding Cost Total Test Costs
Zencoder Launch $40/mo 1000 Encoding Mins $0.08/min HD; $0.04/min SD; $0.01/min audio $91
encoding.com Max $299/mo 75GB Encoded Media NA $299
Sorenson Media Squeeze Managed Server $199/mo 1200 Encoding Mins $8 per 60 Mins extra $375 (incl addl encoding mins)
Panda 4 dedicated encoders $396 Unlimited encoding - $0.15/GB to upload encoded media NA $436 (incl bandwidth)

Disclaimer

This comparison was sponsored by Zencoder. In order to sustain our ability to provide useful, free & publicly accessible analysis, we frequently take on paid engagements. However, in doing so, we try to maintain our credibility and objectivity by using reliable tests, being transparent about our test methods, and attempting to represent the data in a fair way.

Summary

In order to maintain our objectivity and independence, we generally do not recommend one service over another. We prefer to simply present the data as it stands, and let readers draw their own conclusions.

Monday, April 25, 2011

An unofficial EC2 outage postmortem - the sky is not falling

Last week Amazon Web Services (AWS) experienced a high profile outage affecting Elastic Cloud Compute (EC2) and Elastic Block Storage (EBS) in 1 of 4 data centers in the US East region. This outage caused some high profile website outages including Reddit, Quora and FourSquare and scores of negative PR. In the proceeding days media outlets and bloggers have written literally hundreds of articles such as Amazon's Trouble Raises Cloud Computing Doubts (New York Times), The Day The Cloud Died (Forbes), Amazon outage sparks frustration, doubts about cloud (Computerworld), and many others.

EC2 and EBS in a nutshell

In case you are not familiar with the technical jargon and acronyms, EBS is one of two methods provided by AWS for setting up an EC2 instance (an EC2 instance is essentially a server) storage volumes (basically a cloud hard drive). Unlike a traditional hard drive that is located physically inside of a computer, EBS is stored externally on dedicated storage boxes and connected to EC2 instances over a network. The second storage option provided by EC2 is called ephemeral, which uses this more traditional method of hard drives located physically inside the same hardware that an EC2 instance runs on. Using EBS is encouraged by AWS and provides some unique benefits not available with ephemeral storage. One such benefit is the ability to recover quickly from a host failure (a host is the hardware that an EC2 instance runs on). If the host fails for an EBS EC2 instance, it can quickly be restarted on another host because its storage does not reside on the failed host. On the contrary, if the host fails for an ephemeral EC2 instance, that instance and all of the data stored on it will be permanently lost. EBS instances can also be shutdown temporarily and restarted later, whereas ephemeral instances are deleted if shut down. EBS also theoretically provides better performance and reliability when compared to ephemeral storage.

Other technical terms you may hear and should understand regarding EC2 are virtualization and multi-tenancy. Virtualization allows AWS to run multiple EC2 instances on a single physical host by creating simulated "virtual" hardware environments for each instance. Without virtualization, AWS would have to maintain a 1-to-1 ratio between EC2 instance and physical hardware, and the economics just wouldn't work. Multi-tenancy is a consequence of virtualization in that multiple EC2 instances share access to physical hardware. Multi-tenancy often causes performance degradation in virtualized environments because instances may need to wait briefly to obtain access to physical resources like CPU, hard disk or network. The term noisy neighbor is often used to describe this scenario in very busy environments where virtual instances are waiting frequently for physical resources causing noticeable declines in performance.

EC2 is generally a very reliable service. Without a strong track record high profile websites like Netflix would not use it. We conduct ongoing independent outage monitoring of over 100 cloud services which shows 3 of the 5 AWS EC2 regions having 100% availability the past year. In fact, our own EBS backed EC2 instance in the affected US East region remained online throughout last week's outage.

AWS endorses a different type of architectural philosophy called designing for failure. In this context, instead of deploying highly redundant and fault tolerant (and very expensive) "enterprise" hardware, AWS uses low cost commodity hardware and designs their infrastructure to expect and deal gracefully with failure. AWS deals with failure using replication. For example, each EBS volume is stored on 2 separate storage arrays. In theory, if one storage array fails, its' volumes are quickly replaced with the backup copies. This approach provides many of the benefits of enterprise hardware, such as fault tolerance and resiliency, while at the same time providing substantially lower hardware costs enabling AWS to price their services competitively.

The outage - what went wrong?

Disclaimer: This is our own opinion of what occurred during last week's EC2 outage based on our interpretation of the comments provided on the AWS Service Health Dashboard and basic knowledge of the EC2/EBS architecture.

At about 1AM PST on Thursday April 21st, one of the four availability zones in the AWS US East region experienced a network fault that caused connectivity failures between EC2 instances and EBS. This event triggered a failover sequence wherein EC2 automatically swapped out the EBS volumes that had lost connectivity with backup copies. At the same time, EC2 attempted to create new backup copies of all of the affected EBS volumes (they refer to this as "re-mirroring"). While this procedure works fine for a few isolated EBS failures, this event was more widespread which created a very high load on the EBS infrastructure and the network that connects it to EC2. To make matters worse, some AWS users likely noticed problems and began attempting to restore their failed or poorly performing EBS volumes on their own. All of this activity appears to have caused a meltdown of the network connecting EC2 to EBS and exhausted the available EBS physical storage in this availability zone. Because EBS performance is dependent on network latency and throughput to EC2, and because those networks were saturated with activity, EBS performance became severely degraded, or in many cases completely failed. These issues likely bled into other availability zones in the region as users attempted to recover their services by launching new EBS volumes and EC2 instances in those availability zones. Overall, a very bad day for AWS and EC2.

The sky is not falling

Despite what some media outlets, bloggers and AWS competitors are claiming, we do not believe this event is reason to question the viability AWS, external instance storage, or the cloud in general. AWS has stated they will evaluate closely the events that triggered this outage, and apply appropriate remedies. The end result will be a more robust and battle hardened EBS architecture. For users of AWS affected by this outage, this should be cause to re-evaluate their cloud architecture. There are many techniques suggested by AWS and prominent AWS users that will help to deal with these types of outages in the future without incurring significant downtime. These include deploying load balanced servers across multiple availability zones and using more than one AWS region.

Netflix is a large and very visible client of AWS that was not affected by this outage. The reason for this is that they have learned to design for failure. In a recent blog post, Adrian Cockroft (Netflix's Cloud Architect), wrote about some of the technical details and shortcomings of EBS. At a high level, the take away points from his post are:

  • EC2, EBS and the network that attach them are all shared resources. As such, performance will vary significantly depending on multi-tenancy and shared load. Performance variance will be greater on smaller EC2 instances and EBS volumes where multi-tenancy is a greater factor
  • Users can reduce the potential affects of multi-tenancy by using larger EC2 instances and EBS volumes. To reduce EBS mulit-tenancy, Netflix uses the largest possible volume size, 1TB. Because each EBS storage array has a limited amount of storage capacity, using larger sized volumes reduces the number of other users that may share that hardware. The same is true of larger EC2 instances. In fact, the largest EC2 instances (any of the 4xlarges) run on dedicated hardware. Because each physical EC2 host has one shared network interface, use of larger EBS volumes and EC2 instances also has the added benefit of increased network throughput
  • Use ephemeral storage on EC2 instances where predictable and consistent performance is necessary. Netflix uses ephemeral storage for their Cassandra datastore and has found it to be more consistently reliable compared to EBS

Too early to throw in the towel

AWS is not alone in experiencing performance and reliability issues with external storage. Based on our independent monitoring Visi, GigeNet, Tata InstaCompute, Flexiscale, Ninefold and VPS.NET have all experienced similar outages. Our monitoring shows that external storage failures are a very significant cause of cloud outages. When external storage systems fail, vendors often have a very difficult time recovering quickly. Designing fault tolerant and performant external storage for the cloud is a very complex problem, so much so that many vendors including Rackspace Cloud and Joyent avoid it entirely. Joyent for example, recently documented their unsuccessful attempt to deploy external storage in their cloud service. However, despite the complexity of this problem, we believe it is far too early for cloud vendors and users to throw in the towel. There are significant advantages to external storage versus ephemeral including:

  • Host failure tolerance: If the power supply, motherboard, or any component of a host system fails, the instances running on it can be quickly migrated to another host
  • Shutdown capability: With most providers, external storage instances can be shutdown temporarily and then incur only storage fees
  • Greater flexibility: External storage offers features and flexibility generally unavailable with ephemeral storage. These may include the ability to backup volumes, create snapshots, clone, create custom OS templates, resize partitions and attach multiple storage volumes to a single instance

Innovation in external storage

Besides AWS, there are other providers innovating in the external storage space. OrionVM, a cloud startup in Australia, has developed their own distributed, horizontally scalable, external storage architecture based on a high performance communication link called Infiniband. Instead of using dedicated storage hardware, OrionVM uses the same hardware for both storage and server instances. The server instances use storage located on multiple external hosts connected to it via redundant 40 Gb/s InfiniBand links. If a physical host fails, the instances running on it can be restored on another host because their storage resides externally. OrionVM also replicates storage across multiple host systems allowing for fault tolerance should a storage host fail. This hybrid approach combines the benefits of ephemeral storage (i.e. lower multi-tenancy ratio, faster IO throughput) with those of external storage (i.e. host failure tolerance). Multi-tenancy performance degradation is also not a significant factor because OrionVM uses a distributed, non-centralized storage architecture. This approach scales well horizontally because adding a new host increases both instance and storage capacity. Use of 40 Gb/s Infiniband also provides very high instance to storage throughput. Our own benchmarking shows very good IO performance with OrionVM. Complete results for these benchmarks are available on our website. A summary is provided below comparing OrionVM to both external and ephemeral instances with EC2, GoGrid, Joyent, Rackspace and SoftLayer. In these results, OrionVM performed very well as did EC2's cluster compute instance using ephemeral or EBS raid 0 volumes. GoGrid also performed well running on their new Westmere hardware and ephemeral storage. Details on the IO metric are available here. We are including these benchmark results to demonstrate that external storage can perform as well or better than ephemeral storage.

Legend

LabelStorage TypeDescription
ec2-us-east.cc1.4xlarge-raid0-localEphemeralEC2 cluster instance cc1.4xlarge, Raid 0, 2 ephemeral volumes
ec2-us-east.cc1.4xlarge-raid0x4-ebsExternalEC2 cluster instance cc1.4xlarge, Raid 0, 4 EBS volumes
ec2-us-east.cc1.4xlarge-localEphemeralEC2 cluster instance cc1.4xlarge, single ephemeral volume
gg-16gb-us-eastEphemeral16GB GoGrid instance
or-16gbExternal16GB OrionVM instance
jy-16gb-linuxEphemeral16GB Joyent Linux Virtual Machine
ec2-us-east.cc1.4xlargeExternalEC2 cluster instance cc1.4xlarge, single EBS volume
ec2-us-east.m2.4xlarge-raid0x4-ebsExternalEC2 high memory instance m2.4xlarge, Raid 0, 4 EBS volumes
rs-16gbEphemeral16GB Rackspace Cloud instance
ec2-us-east.m2.4xlargeExternalEC2 high memory instance m2.4xlarge, single EBS volume
sl-16gb-wdcExternal16GB SoftLayer CloudLayer instance

Summary

Last week's EBS outage has shed some light on what we consider to be one of the biggest cruces of the cloud, the problem of external storage. However, we see this event more in terms of the glass half full. First, we believe that AWS will thoroughly dissect this outage and use it to improve the fault tolerance and reliability of EBS in the future. Next, cloud users affected by this outage will re-evaluate their own cloud architecture and adopt a more failure tolerant approach. Finally, we hope that AWS and other vendors like OrionVM will continue to innovate in the external storage space.

 

Saturday, January 15, 2011

Do SLAs really matter? A 1 year case study of 38 cloud services

In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.

The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.

SLA Credit Policies

In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:

  • Pro-rated Credit (Pro-rated): Credit is based on a simple pro-ration on the amount of downtime that exceeded the SLA guarantee. Credit is issued based on that calculated exceedance and a credit multiple ranging from 1X (Linode) to 100X (GoGrid) (e.g. with GoGrid a 1 hour outage gets a 100 hour service credit). Credit is capped at 100% of service fees (i.e. you can't get more in credit than you paid for the service). Generally SLA credits are just that, service credit and not redeemable for a refund
  • Threshold Credit (Threshold): Threshold-based SLAs may provide a high guaranteed availability, but credits are not valid until the outage exceeds a given threshold time (i.e. the vendor has a certain amount of time to fix the problem before you are entitled to a service credit). For example, SoftLayer provides a network 100% SLA, but only issues SLA credit for continuous network outages exceeding 30 minutes
  • Percentage Credit (Percentage): This SLA credit policy discounts your next invoice X% based on the amount of downtime and the stated SLA. For example, EC2 provides a 10% monthly invoice credit when annual uptime falls below 99.5%

The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.

SLAs versus Measured Availability

The SLA data provided below is based on current documentation provided on each vendor's website. The Actual column is based on 1 year of monitoring (a few of the services listed have been monitored for less than 1 year), using servers we maintain with each of these vendors. We have included 38 IaaS providers in the table. We currently monitor and maintain availability data on 90 different cloud services. The Actual column is highlighted green if it is equal to or exceeds the SLA.

ProviderData CenterTotal # Outages / Mins DownSLA Credit PolicySLAActual
AWS EC2US East0/0
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%100%
AWS EC2US West0/0
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%100%
GoGridUS West0/0
Pro-rated
100x credit for any downtime
100%100%
Linode VPSLondon0/0
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%100%
OpSource CloudVA, US0/0
Percentage

5% invoice credit for 60 minutes downtime 10% for up to 120 minutes and so on
100%100%
Storm on DemandMI, US0/0
Pro-rated
10x credit for any downtime
100%100%
VoxCLOUDEU0/0
Percentage

5% invoice credit per 0.1% downtime up to 100%
100%100%
GoGridUS East1/2.3
Pro-rated
100x credit for any downtime
100%99.999%
Joyent Smart MachinesAndover, MA1/3
Percentage

5% of the monthly fee for each 30 minutes of downtime
100%99.999%
VoxCLOUDSingapore1/5.5
Percentage

5% invoice credit per 0.1% downtime up to 100%
100%99.999%
Speedyrails VPSPeer1 Quebec1/2.2
Percentage

3% of monthly fees for every 0.1% of downtime
99.9%99.999%
Rackspace CloudDallas, TX1/8.7
Threshold/ Percentage

5% of the fees for each 30 minutes of network downtime (1 hour for hardware) up to 100% Host hardware failures guaranteed to be fixed within 1 hour of problem identification
100%199.998%
SoftLayer CloudLayerDallas, TX4/13.9
Threshold/ Percentage

5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%199.997%
Hosting.comColorado1/1.4
Percentage

1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair
100%199.997%
AWS EC2APAC5/14.8
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%99.996%
LinodeAtlanta10/26.9
Pro-rated

Pro-rated 1x credit for downtime exceeding 0.1%
99.9%99.995%
Joyent Smart MachinesEmeryville, CA4/15.2
Percentage

5% of the monthly fee for each 30 minutes of downtime
100%99.994%
Terremark vCloudFL, US7/37.9Unique $1 for every fifteen 15 minute downtime period up to a maximum amount equal to 50% of the usage fees100%99.993%
AWS EC2EU West3/36
Percentage

10% invoice credit anytime annual uptime falls below 99.5%
99.5%99.993%
Speedyrails VPSCanix Quebec9/38.7
Percentage

3% of monthly fees for every 0.1% of downtime
99.9%99.992%
LinodeFremont, CA13/71.92
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%99.986%
ZerigoCO, CA9/66.8
Pro-rated

4x the total (starting from 100%, not 99.99%) non-compliant time
99.99%99.985%
SoftLayer CloudLayerDC, US31/86.7
Threshold/ Percentage

5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%199.984%
SoftLayer CloudLayerWA, US13/106.8
Threshold/ Percentage

5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%199.980%
LinodeNJ, CA14/145.7
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%99.972%
VoxCLOUDNY, US12/146.33
Percentage

5% invoice credit per 0.1% downtime up to 100%
100%99.972%
CloudSigmaSwitzerland22/59.9
Threshold/ Percentage
50x credit for any downtime (network or hardware) over 15 minutes
100%99.972%
Hosting.comKY, US4/38.74
Percentage

1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair
100%199.955%
ThePlanet Cloud ServersTX, US34/144.3
Threshold/ Percentage

5% monthly invoice credit for first 5 minute continuous outage (hardware or network) Then, 5% additional credit for each additional 30 minute continuous outage
100%99.955%
Gandi VPSFrance4/147.7
Pro-rated

1 day credit for every outage over 7 minutes within a single day
99.95%99.955%
LinodeDallas21/258.2
Pro-rated

1x credit for downtime exceeding 0.1%
99.9%99.951%
NewServersFL, US39/288.7
Pro-rated

24x credit for every 1 hour of downtime exceeding 0.001%
99.999%99.945%
VPS.NETUK8/250.3
Percentage

10% monthly invoice credit for each hour of downtime
100%599.921%
VPS.NETUS Central12/342.9
Percentage

10% monthly invoice credit for each hour of downtime
100%599.892%
FlexiantUK83/820.36
Percentage

5% monthly invoice credit for each 30 minutes of downtime
100%99.844%
VPS.NETUS West32/576.5
Percentage

10% monthly invoice credit for each hour of downtime
100%599.819%
ReliaCloudMN, US23/1941.57
Pro-rated
30x hourly credit for each hour downtime
100%99.626%
VPS.NETUS East6/1224.18
Percentage

10% monthly invoice credit for each hour of downtime
100%599.616%

1 Applies to network connectivity only, not hardware outages

2 Linode does not own or operate this data center (or any of it's data centers to our knowledge). This particular data center in Fremont, CA is owned and operated by Hurricane Electric. About 20 minutes of the outages triggered for this location were due to data center wide power outages completely outside of the control of Linode

3 A majority of this downtime (114 minutes) was due to a SAN failure on 10/15/2010

4 A majority of this downtime (34.5 minutes) was due to an internal network failure on 1/5/2011. We've been told this problem has since been resolved

5 Applies only for clients who have signup for the VPS.net "Managed Support" package ($99/mo). It appears that VPS.net does not provide any SLA guarantees to other customers.

6 Approximately 560 minutes of these outages occurred due to failure of their SAN

7 A majority of these outages (1811 minutes) occurred between Jan-Feb 2010 immediately following ReliaCloud's public launch (post beta). A majority of the downtime seems to have occurred due to SAN failures

8 Explanation provided for approximately 1200 minutes of these outages (2 separate outages) was "We had a problem on the cloud. Now your VPS is up and running"

Is there a correlation between SLA and actual availability?

The short answer based on the data above is absolutely not. Here is how we arrived at this conclusion:

Total # of services analyzed:38
Services that meet or exceeded SLA:15/38 [39%]
Services that did not meet SLA:23/38 [61%]
Vendors with 100% SLAs:23/38 [61%]
Vendors with 100% SLAs achieving their SLA:4/23 [17%]
Mean availability of vendors with 100% SLAs:99.929% [6.22 hrs/yr]
Median availability of vendors with 100% SLAs:99.982% [1.58 hrs/yr]

It is very interesting to observe that the bottom 6 vendors all provided 100% SLAs, while 3 of the top 7 provide the lowest SLAs of the group (EC2 99.5% and Linode 99.9%). SLAs were only achieved for a minority (39%) of the vendors. This is particularly applicable to vendors with 100% SLAs where only 4 of 23 (17%) actually achieved 100% availability.

Vendors with generous SLA credit policies

In most cases SLA credit policies provide extremely minimal financial recourse not considering all of the hoops you'll have to jump through to get them. Not one of the SLA we reviewed allowed for more than 100% of service fees to be credited. There are a few vendors that stood out by providing relatively generous SLA credit policies:

  • GoGrid: provides a 100x credit policy combined with 100% SLA for any hardware and network outages and no minimum thresholds (e.g. 1 hour outage = 100 hour credit). This is by far the most generous of the 38 IaaS vendors we evaluated. GoGrid's service is also one of the most reliable IaaS services we currently monitor (100% US West and 99.999% US East)
  • Joyent: provides a 5% invoice SLA credit for each 30 minutes of monthly (non-continuous) downtime (equates to about 72x pro-rated credit) combined with 100% SLA and no minimum outage thresholds
  • VoxCloud: provides a 5% invoice credit per 0.1% of monthly (non-continuous) downtime (about every 45 minutes - equates to about 48x pro-rated credit) combined with 100% SLA and no minimum outage thresholds

Some Extra Cool Stuff: Cloud Availability Web Services and RSS Feed

We've recently released web services and an RSS feed to make our availability metrics and monitoring data more accessible. Up to this point, this data was only available on the the Cloud Status Tab of our website. We currently offer 30 different REST and SOAP web services for accessing cloud benchmarking and monitoring data, and vendor information.

Cloud Outages RSS Feed

This feed provides information about potential outages we are currently observing with any of the 90 cloud services we monitor. Click here to view and subscribe to this feed.

getAvailability Web Service

This post includes a small snapshot of the data we maintain on cloud availability. We have released a new web service that allows users to calculate availability and retrieve outage details (including supporting comments) for any of the 90 cloud services we currently monitor. Monitoring for many of these services began between October 2009 and January 2010, but we are also continually adding new services to the list. This web service allows users to calculate availability and retrieve outage information for any time frame, service type, vendor, etc. To get you started, we have provided a few example RESTful request URLs. These example requests return JSON formatted data. To request XML formatted data append &ws-format=xml to any of these URLs. Full API documentation for this web service is provided here. A SOAP WSDL is also provided here. You may invoke this web service for free up to 5 times daily. To purchase a web service token allowing additional web service invocations click here.
Retrieve availability for all IaaS vendors for the past year (first 10 of 46 results)
Retrieve availability for all IaaS vendors for the past year (results 11-20 of 46)
Retrieve availability for all CDNs for 2010 (first 10 of 13 results)
Retrieve availability for all CDNs for 2010 (results 11-13 of 13)
Retrieve availability for all AWS services (EC2, S3, CloudFront) for the past 6 months
Retrieve availability for GoGrid Cloud Servers for the past 2 weeks
Retrieve availability for VPS.net's US East data center since 1/1/2010 - include full outage documentation

Summary

Don't let SLAs lull you into a false sense of security. SLAs are most likely influenced more by marketing and legal wrangling than having any basis in technical merits or precedence. SLAs should not be relied upon as a factor in estimating the stability and reliability of a cloud service or for any form of financial recourse in the event of an outage. Most likely any service credits provided will be a drop in the bucket relative to the reduced customer confidence and lost revenue the outage will cause your business. The only reasonable way to determine the actual reliability of a vendor is to use their service or obtain feedback from existing clients or services such as ours. For example, AWS EC2 maintains the lowest SLA of any IaaS vendor we know of, and yet they provide some of the best actual availability (100% for 2 regions, 99.996% and 99.993%). Beware of the fine print. Many cloud vendors utilize minimum continuous outage thresholds such as 30 minutes or 2 hours (e.g. SoftLayer) before they will issue any service credit regardless of whether or not they have met their SLA. In short, we are of the opinion that SLAs really don't matter much at all.

 

Sunday, October 24, 2010

Introducing Web Services for Cloud Performance Metrics

Over the past year we've amassed a large repository of cloud benchmarks and metrics. Today, we are making most of that data available via web services. This data includes the following:

  • Available Public Clouds: What public clouds are around and which cloud services they offer including:
    • Cloud Servers/IaaS: e.g. EC2, GoGrid
    • Cloud Storage: e.g. S3, Google Storage
    • Content Delivery Networks/CDNs: e.g. Akamai, MaxCDN, Edgecast
    • Cloud Platforms: e.g. Google AppEngine, Microsoft Azure, Heroku
    • Cloud Databases: e.g. SimpleDB, SQL Azure
    • Cloud Messaging: e.g. Amazon SQS, Azure Message Queue
  • Cloud Servers: What instance sizes, server configurations and pricing are offered by public clouds. For example, Amazon's EC2 comes in 10 different instances sizes ranging from micro to 4xlarge. Our cloud servers pricing data includes typical hourly, daily, monthly pricing as well as complex pricing models such as spot pricing (dynamically updated) and reserve pricing where applicable
  • Cloud Benchmark Catalog: This includes names, descriptions and links to the benchmarks we run. Our benchmarks cover both system and network performance metrics
  • Cloud Benchmark Results: Access to our repository of 6.5 million benchmarks including advanced filtering, aggregation and comparisons. We are continually conducting benchmarks so this data is constantly being updated

We are releasing this data in hopes of improving transparency and making the comparison of cloud services easier. There are many ways that this data might be used. In this post, we'll go through a few examples to get you started and let you take it from there.

Our web services API provides both RESTful (HTTP query request and JSON or XML response) and SOAP interfaces. The API documentation and SOAP WSDLs are published here: http://api.cloudharmony.com/api

The sections below are separated into individual examples. This is not intended to be comprehensive documentation for the web services, but rather a starting point and a reference for using them. More comprehensive technical documentation is provided for each web service on our website.

Example 1: Lookup Available Clouds

In this first example we'll use the getClouds web service to lookup all available public clouds. The table at the top of the web service documentation describes the data structure that is used by this web service call.

Request URI: JSON Response

Request URI: XML Response

Note on pagination

Due to the large amount of data that can be returned, our web services utilize results pagination (similar to to getting multiple pages of results for a web search). The maximum number of results request to this web service will return is 10. You may set a limit lower than 10 using the ws-limit request parameter, but not greater than 10. The example request URIs above return only the first 10 results (as determined by the limit response value). At the time of this writing there were 37 total records (as determined by the count response value). To return the remaining 27 results, utilize the following URIs:

Request URI: Results 11-20

Request URI: Results 21-30

Request URI: Results 31+

Note on SOAP

In this post we'll only be showing use of the RESTful API interface. A SOAP interface is also provided. The base API documentation page includes links to WSDLs you may use to import and utilize the SOAP interface (some IDEs let you import WSDLs). The parameters and response structure for SOAP requests are very similar, but not identical to the REST interface (XML names may differ slightly from HTTP request and response names). The WSDL for the getClouds web service is available here: http://api.cloudharmony.com/getClouds/wsdl

Example 2: Search for cloud with servers, storage and CDN services

In this Example you'll use the same getClouds web service again, but we'll add a few constraints so only clouds with server, storage and CDN services are returned. Unless otherwise stated, from this point forward the same rules with regards to pagination apply (see Example #1 for more details). Additionally, only JSON URIs will be shown (to use XML responses, simply add the parameter ws-format=xml to the URI).

Request URI

Note on constraints

In the example URI above, we used ws-constraint parameters to filter the results. Constraints can be applied to specific attributes defined by the data structure for a given web service. The data structure is documented in a table at the top of the web services documentation page. In this example, we used 3 such attributes: hasServers, hasStorage, and hasContentDelivery. Because these are boolean type attributes, we assigned constraint values 1, signifying that only clouds where those attributes are TRUE should be returned.

The API also supports more complex constraint parameters. This example utilizes the most simple form of constraints by testing for equality and joining the 3 constraints with an AND connective. Constraints can also be used to check that an attribute is less or greater than a desired value, and use of an OR connective if multiple constraints are specified. We'll go into these types of constraints in a proceeding example.

Response

In this Example, only 5 public clouds are returned instead of the 37 returned in the previous example, signifying that only 5 public clouds offer all cloud servers, storage and CDN services.

Example 3: Retrieve a specific public cloud

Each data structure has an attribute that is the unique identifier. This attribute is called the primary key. The getClouds web service can be used to retrieve a specific public cloud if you know the primary key for the cloud. In this example, we'll use this feature to retrieve the AWS (Amazon Web Services) public cloud.

Request URI

Response

The response is almost identical to the previous 2 requests with the exception that the base response value is not an array. When a web service is invoked for a specific cloud using the primary key as we've done here, the response will always be a single data structure value. This is in contrast to the previous 2 requests that returned multiple clouds using an array as the base data structure.

Example 4: Retrieve Cloud Server Service for AWS

In this Example we'll use the getCloudServerServices web service to return the cloud server/IaaS service for AWS (Amazon Web Services) - EC2. We know AWS has such a service because the boolean hasServers attribute was true in the previous examples. The API documentation shows that the CH_CloudServer data structure contains an attribute named cloud that references the cloud that the service belongs to. In order for this web service to return only the cloud server service belonging to AWS, we'll just need to add a single ws-constraint for this attribute.

Request URI

Response

Even though only a single result for EC2 was returned, the base response data structure is an array. This will always be the case when invoking the API for a data structure without a primary key (as we did in Example 4 above), because when this is the case, there is always the possibility that multiple results could be returned.

Example 5: Find all cloud services that support Windows

Suppose you are looking to deploy a Windows server in the cloud. Because the CH_CloudServer data structure has an attribute operatingSystemsSupported that defines which operating systems are supported by that service, we can use it in conjunction with a ws-constraint request parameter to filter the results accordingly. In our previous use of constraints, we used the default equality operator. In this example, we'll need to change the operator to a substring search. This is because the operatingSystemsSupported attribute is an array which may contain multiple values representing all of the operating systems supported by the service (i.e. Linux, Windows, etc.). By using the substring operator, the request will search for services where the operatingSystemsSupported attribute contains Windows. The substring operator is the numeric value 32 (operators are numeric to support multiple operators using bitmasks). The operators supported and their corresponding values are shown on the API documentation.

Request URI

Response

At the time of this writing, this request returned 14 services that support the Windows operating system.

Example 6: Find other cloud services

In addition to the getCloudServerServices web service discussed in Examples 4-5, the following additional web services are provided: getCloudDatabaseServices, getCloudMessagingServices, getCloudPlatformServices,getCloudStorageServices and getCDNs. The usage for each of these is identical. In the example below we'll use them for various lookups.

We are still in the process of populating vendor profiles for different cloud services. Currently only basic information is provided by the web services. In the future, the data structures for these services will be expanded to include many additional details such as pricing, SLAs, features, technical details, etc.

Lookup all CDNs

Lookup storage services for AWS

Lookup database services for Azure

Lookup all cloud platforms (i.e. Google AppEngine, Microsoft Azure)

Lookup all cloud messaging services (i.e. AWS SQS)

Example 7: Get the full benchmark catalog

Up to this point we've been using web services to lookup which clouds and cloud services are available. The remaining Examples will involve retrieving benchmarking related data. To get started, we'll first need to determine which benchmarks are available using the getBenchmarks web service. This web service provides access to information about the benchmarks we conduct. Unlike the previous web services, getBenchmarks does not support the ws-constraint parameters to filter results. This is always the case when the top section of the API documentation page does not show a data structure table. Instead of ws-constraint filters, getBenchmarks supports 4 request parameters (these are shown on the right column of the API documentation table):

  • aggregateOnly: set to TRUE if only aggregate benchmarks should be returned (see not on aggregate benchmarks below)
  • nonAggregateOnly: set to TRUE if only non-aggregate benchmarks should be returned (see not on aggregate benchmarks below)
  • category: return only benchmarks in this category. Multiple categories may be specified separated by pipe characters (see Example 8 below)
  • serverOnly: set to TRUE if only cloud server benchmarks should be returned. These are benchmarks we run on cloud servers only

Benchmarks are assigned to 1 or more categories. The getBenchmarkCategories web service may be used to obtain all of the available benchmark categories (see Example 8 below).

In this example, we'll retrieve all benchmarks (or at least the first 10 due to pagination).

Request URI: Results 1-10

Request URI: Results 11-20

Response

The response from this web service is an array of benchmarks each containing the follow values:

  • benchmarkId: the identifier of the benchmark
  • title: the benchmark title
  • subtitle: the benchmark subtitle
  • categories: the categories for this benchmark (an array)
  • description: the benchmark description
  • url: URL to this benchmark's website (if available)
  • lowerIsBetter TRUE if a lower score is better for this benchmark
  • aggregate: TRUE if this benchmark is an aggregate of multiple benchmarks (see not on aggregate benchmarks below)
  • benchmarks: if this is an aggregate benchmark, this value will provide details about which individual benchmarks are included in it and their corresponding weights. This return value is an array of values each containing the following keys:
    • benchmarkId: the id of the individual benchmark
    • weight: the weight assigned to this benchmark
    • alternates: if this benchmark has alternate benchmarks (benchmarks used if this benchmark is not available), this will be an array representing the IDs of those alternate benchmarks
  • baseline: if this is an aggregate benchmark, this return value will be provided defining the baseline definition for it. The baseline definition determines how the aggregate benchmark metric is calculated. This response value is an array containing one or more server/value pairs. The aggregate score is based on determining how the benchmarked server performs relative to the baseline servers. If the server performs better, the metric will be higher than the baseline. If the server performed worse, the metric will be lower. This response is a hash of key/value pairs where the key is the serverId and the value is the score that is assigned if the benchmarked server performed exactly the same as that server. For more information on baselines and aggregate metric calculation is available on the What is an ECU? CPU Benchmarking in the Cloud post on our blog

Note on aggregate benchmarks

Aggregate benchmarks are a special type of benchmark that aren't benchmarks themselves, but rather a compilation of multiple benchmark result metrics. This compilation is used in conjunction with a baseline configuration to produce a more comprehensive benchmark metric related to some facet of performance. A more detailed description of aggregate benchmarks and baselines is discussed on our blog. CCU is one such aggregate benchmark discussed here.

Example 8: Get benchmark categories

Every benchmark is assigned to one or more categories. The getBenchmarkCategories web service returns a list of all possible benchmark category name. This web service is very simple. It does not use any parameters or pagination.

Request URI

Example 9: Get only server benchmarks in category System: CPU

In this example we'll use the same getBenchmarks web service to retrieve only server benchmarks in the category System: CPU (we discovered this category previously using the getBenchmarkCategories web service). To accomplish this, we'll use the serverOnly and category request parameters.

Request URI

Example 10: What server benchmarks have been run

Before attempting to analyze benchmark results, it may be helpful to first determine what benchmark results data is available including which clouds and server configurations have been benchmarked. Generally, we conducted cloud server benchmarking 3-4 times each year. Every benchmark test run has a unique testId. The typical format of a testId is MMYY-[SEQ]. For example, the test 0410-1 was conducted in April 2010. Do determine what tests have been run within clouds the getServerCloudsBenchmarked web service may be used. This web service uses the following parameters:

  • serviceId: the ID of a service or cloud to return test information for. Multiple IDs may be specified each separated by a pipe character
  • start: if specified, only services that have been benchmarked on or after this date will be returned
  • stop: if specified, only services that have been benchmarked on or before this data will be returned

The return value is an array of cloud server services and the corresponding testing information for those services including testIds and testing dates.

Request URI

Response

The response from this web service is an array of services and information about the benchmark tests that have been conducted within those services.

  • id: the id of this service
  • name: the name of the service
  • testIds: the IDs of tests performed (array)
  • numTests: the number of tests that have been conducted for this service (same as number of elements in testIds)
  • lastTestId: the ID of the last test that was run for this service
  • lastTestDate: the date of the last test that was run for this service
  • url: the URL to the service's website

Example 11: What server benchmarks have been run in the GoGrid and Amazon clouds after June 2010

In this example, we'll use the same getServerCloudsBenchmarked web service to determine when testing has occurred only in the AWS and GoGrid clouds on or after June 2010. To do so, we'll use the serviceId and start parameters to filter the results. The serviceId parameter can be either the ID of the specific server service or the ID of a cloud.

Note on dates and times

Whe specifying dates or dates and times, most standard formats are supported such as 6/1/2010 or 2010-06-01 or June 1 2010. Date data types are returned by web services as a text value unless the ws-js-dates parameter is set to TRUE in which case it will be returned using a javascript Date object (only applicable to JSON responses).

Request URI

Example 12: Get all Geekbench benchmark results for Rackspace Cloud and GoGrid

When it comes down to retrieving cloud server benchmark metrics we'll use the getServerBenchmarkResults web service. This web service requires 2 parameters:

  • benchmarkId: the identifier(s) of the benchmarks that should be returned (REQUIRED). Multiple IDs may be specified separated by pipe characters
  • serviceId: the identifier(s) of the cloud server service to return the benchmarks for (REQUIRED). Multiple IDs may be specified separated by pipe characters. Alternatively, this parameter may be left out if the serverId parameter below is specified

Additionally, the following parameters may optionally be provided:

  • serverId: the identifier(s) of the server to return benchmark metrics for. Multiple IDs may be specified each separated by a pipe character
  • dataCenter: the identifier(s) of a specific service data center to return benchmarks for (if the cloud server service operates out of multiple data centers). Multiple data centers may be specified separated by pipe characters. For example, AWS EC2 operates out of 4 regions - US West, US East, EU West and APAC currently. These regions are located in California, Virginia, Ireland and Singapore respectively. To return only results for the US West data center, this parameter should be set to CA, US. To return metrics for both US West and EU West data centers, this parameter would be CA, US|IE (IE is the ISO 3166 code for Ireland)
  • testId: the identifier of a specific test for the benchmarks that should be returned. Multiple IDs may be specified separated by pipe characters
  • lastBenchmarksOnly: set to TRUE if only the latest benchmark test should be included in the results. This guarantees that only a single set of results will be returned
  • combineMultiple: If multiple benchmark metrics are included in the results, this parameter defines how those values should be returned as a single value. Valid options are:
    • average: use an average of all values (default)
    • lowest: return the worst value (may be the lowest or highest value depending on whether higher or lower scores are better for the benchmark)
    • highest: return the best value
    • earliest: return the value from the earliest test
    • latest: return the value from the latest test

As you can see, requests using this web service can be quite complex if desired. In this example, we'll keep it simple by using only the benchmarkId and serviceId parameters. The Geekbench benchmark produces a metric that rates CPU and memory performance

Request URI

Response

The response is an array of benchmark result metrics each consisting of the following values:

  • serverId: the ID of the server this metric pertains to
  • serviceId: the ID of the service serverId pertains to
  • benchmarkId: the ID of the benchmark this metric pertains to
  • value: the benchmark metric. If this result consists of multiple benchmark values, value will be an average of all results unless the combineMultiple request parameter specifies otherwise
  • testDate: the date of the test (if result is from a single benchmark test)
  • testId: the ID of the test (if result is from a single benchmark test)
  • resultsUrl: some benchmark result artifacts are accessible online. When this is the case, this will be the URL to those artifacts (if result is from a single benchmark test)
  • values: the values of the tests (if result is from multiple benchmark tests)
  • testDates: the dates of the tests (if result is from multiple benchmark tests)
  • testIds: the ID of the tests (if result is from multiple benchmark tests)
  • resultsUrls: the URLs to the test results (if result is from multiple benchmark tests)
  • numTests: the number of tests used to calculate the value

Example 13: Get the latest CCU benchmarks for the EC2 APAC region

In this example, we'll use the dataCenter and lastBenchmarksOnly parameters to return all of the CCU benchmark results for Amazon EC2's APAC region (this region is located in Singapore - hence the dataCenter parameter is set to the ISO 3166 country code SG). Unlike the previous example where multiple test results were returned, in this example because lastBenchmarkOnly is TRUE, the web service will only return a single benchmark value (the values, testDates, testIds and resultsUrls values will not be included in the response). CCU is an aggregate benchmark consisting of many underlying CPU performance related benchmarks as discussed here.

Request URI

Example 14: Get available cloud server configurations for the Amazon EC2 APAC region

Before proceeding any further with getServerBenchmarkResults examples, we'll demonstrate how to find out what server configurations are available for a given cloud service. This is useful because the getServerBenchmarkResults supports a serverId parameter that can be used to filter benchmark results using a specific server identifier. For example, you may want to compare benchmark results between EC2 m2.4xlarge and GoGrid 16GB cloud servers only.

The getCloudServerConfigurations web service allows you to lookup cloud server configurations. This web service uses a data structure containing various details about cloud servers including CPU, memory, and storage specifications; pricing and more (review the API documentation for full details). Because this web service is based on a data structure, we'll be able to use ws-constraint parameters to filter the results. In this example, we'll use 2 constraints (cloud and dataCenter) to filter the results so that only Amazon EC2 APAC region servers are returned.

Request URI

Example 15: Compare IOP benchmark results between Rackspace Cloud and GoGrid 4GB cloud servers

Now that we've been able to obtain the identifiers of cloud servers using the getCloudServerConfigurations web service, we can go back to the getServerBenchmarkResults web service and compare cloud servers using those IDs and the serverId parameter. In this example, we'll compare storage IO performance between 4GB Rackspace Cloud and GoGrid cloud servers (gg-4gb and rs-4gb) using the aggregate IOP benchmark. IOP is an an aggregate storage IO benchmark based on 7 IO related benchmarks as documented here. This is benchmark is NOT the same as IOPS. To invoke retrieve the IOP benchmark results for only Rackspace and GoGrid 4GB cloud servers, we'll set the serverId parameter to gg-4gb|rs-4gb (multiple IDs can be specified each separated by a pipe character).

Request URI

Example 16: Lookup all cloud servers in the US with at least 2GB memory and costing $0.10/hr or less

In this example, we'll use the getCloudServerConfigurations to lookup US-based cloud services offering cloud servers with at least 2GB memory and costing $0.10/hr or less. This will involve use of 4 filtering constraints: dataCenter, memory, priceHourly and priceCurrency. In order to apply these constraints, we'll first need to determine what operators should be used.

The dataCenter attribute value is either [state/province], [country] (US or Canada only) OR [country]. Thus, we'll want the dataCenter attribute to "end with" "US". According to the API documentation, the "ends with" operator is 16.

The memory attribute is a numeric value representing the # of gigabytes included with a cloud server. We'll want this attribute to be equal to or greater than 2. The operator for "equal to" is 1. The operator for "greater than" is 2. Thus, an "equal to or greater than" operator is 1+2=3 (bitmask addition).

The priceHourly attribute is also numeric representing the price of the server per hour. We'll want this attribute to be equal to or less than 0.10. The operator for "equal to" is 1, and the operator for "less than" is 4. Thus, an "equal to or less than" operator is 1+4=5.

The priceCurrency attribute is a string representing the currency code for pricing defined in the server configuration (USD = US dollar). Thus we want this attribute to be equal to "USD". Equality is the default operator, so we do not need to provide an operator value for this constraint.

Request URI

Response

At the time of this writing, only gigenet cloud offers a cloud server with these specifications.

Example 17: Determine average uplink throughput from GoGrid US West to Amazon S3 US West, Zetta and Google Storage

In addition to system benchmarks, we also continually collect networking benchmark metrics. These include both throughput and latency metrics within clouds, between clouds, and from clouds to consumer (i.e. residential Internet connections such as DSL and cable to various cloud services).

Suppose you are evaluating cloud services and decide to use GoGrids' cloud servers. Your business and customers are in California, so you opt to use GoGrid's US West data center. For added protection against a large scale failure, you decide to use an external storage service for backups (instead of GoGrid's own storage service). You've narrowed your storage choices down to either Amazon S3, Zetta or Google's Storage for Developers. You'd like to know which of these storage services will provide the fastest uplink throughput from your cloud servers at GoGrid in order to ensure that backups can be uploaded as quickly as possible. The getNetworkBenchmarkResults web service provides access to this sort of data. This web service uses the following parameters:

Request Parameters

  • serviceId: the ID of the service to get network benchmark metrics for (REQUIRED). Multiple IDs can be specified each separated by a pipe character. Adding multiple IDs each ID specified will essentially double execution time for this web service, so use this feature sparingly
  • dataCenter: if the serviceId specified is operated out of multiple data centers, this parameter must also be provided defining the location of the data center to return results for. Multiple data centers may be specified each separated by a pipe character. Adding multiple data centers will essentially double execution time so use this feature sparingly
  • testId: we conducted multiple network performance tests. This parameter should be the name of the network test that results should be returned for. The following network tests are currently used:
    • intercloud: results originate from our intracloud/intercloud network performance tests. These tests are run throughout the day at varying times to test throughput and latency between and within cloud services (DEFAULT)
    • speedtest: get results from our browser-based cloud speedtest. We allow Internet users to run this test for free. We also pay about 1000 users each month to run this test using Amazon's Mechanical Turk. When the speedtest is run, we capture the user's location (city, state, country), ISP and connection speed (netspeed) using MaxMind's GeoIP databases. Users select a test file between 1-5 MB to test download throughput, or a test file between 0.5-2.5MB to test upload throughput. User's may also test latency
    • speedtest-web: this is the same as speedtest, except that instead of downloading a single large file, many small files are downloaded to simulate an actual web page load. Users select a small (10 files), medium (19 files) or large (51 files) website to test
  • endpoint_*: the endpoint parameters are used to define a service, region, location (city, state or country), ISP or netspeed (or some combination of those) for which network benchmark metrics should be returned. At least 1 endpoint parameter must be specified. The endpoint parameters used must correspond with the testId specified. The following endpoint parameters are allowed:
    • endpoint_cloudId: return results for all services in this cloud. This parameter applies only to the intercloud testId and cannot be used in conjunction with any other endpoint parameters except for endpoint_dataCenter. Multiple IDs may be specified each separated by a pipe character. Results will be grouped by service and data center (multiple results possible for each service and data center)
    • endpoint_serviceId: return results for a specific cloud service. This parameter applies only to the intercloud testId and cannot be used in conjunction with any other endpoint parameters except for endpoint_dataCenter. Multiple IDs may be specified each separated by a pipe character. Results will be grouped by service and data center (multiple results possible for each data center)
    • endpoint_dataCenter: if the endpoint_cloudId or endpoint_serviceId parameters are specified and services are operated out of multiple data centers, this parameter may be used to limit the results to a specific data center location. Multiple data center locations may be specified each separated by a pipe character
    • endpoint_region: a specific region identifier to return results for. This parameter may be used for any test type (speedtest, speedtest-web or intercloud) but may not be used in conjunction with any other endpoint parameters except for endpoint_isp and endpoint_netspeed. Region identifiers and configurations are available using the getRegions web service. Only a single region may be specified, and results are grouped by region (single result for each invocation unless endpoint_isp or endpoint_netspeed are also specified)
    • endpoint_city: a specific city to return benchmark results for. If used, endpoint_country MUST also be specified. This parameter applies only to speedtest or speedtest-web tests. This parameter is not case sensitive. Multiple cities may be specified each separated by a pipe character. Set this parameter to the wildcard character * to return results for all cities for the endpoint_state (optional) and endpoint_country specified. Results will be grouped by city (multiple results possible for each city specified)
    • endpoint_state: a specific state or province to return benchmark results for. If used, endpoint_country will be automatically determined if not specified (US for US states, CA for Canadian provinces). This parameter applies only to speedtest or speedtest-web tests. This parameter can only be used for US and Canada test results because the GeoIP database only supports state/provinces in those countries. It should be the 2 character code for the state or province and is not case sensitive (i.e. NY, CA or QC). Multiple states/provinces may be specified each separated by a pipe character. Set this parameter to the wildcard character * to return results for all states/provinces for the endpoint_country specified. Results will be grouped by state (multiple results possible for each state). If used in conjunction with the endpoint_city parameter, results will be grouped by city
    • endpoint_country: a specific country to return benchmark results for. This parameter applies only to speedtest or speedtest-web tests only. May be used in conjunction with the endpoint_city and endpoint_state parameters. It should be the 2 character ISO 3166 code for the country and is not case sensitive (i.e. US, CA or FR). Multiple countries may be specified each separated by a pipe character. Set this parameter to the wildcard character * to return results for all countries. Results will be grouped by country (multiple results possible for each country) unless endpoint_city or endpoint_state are also specified in which case results will be grouped according to those parameters
    • endpoint_isp: the name of a specific ISP to return benchmark results for. This parameter may be used alone or in conjunction with the endpoint_region, endpoint_city, endpoint_state or endpoint_country parameters. This parameter applies to speedtest or speedtest-web tests only. This parameter is not case sensitive and can also be a substring match to the ISP name (e.g. Verizon will return multiple results for Verizon Business,Verizon Internet Services and Verizon Australia PTY Limited). Multiple ISPs may be specified each separated by a pipe character. Set this parameter to the wildcard character * to return results for all ISPs. The getSpeedtestIsps web service may be used to obtain the names of ISPs for which results are available. If this parameter is specified, results will be grouped by ISP in addition to existing grouping. For example, if the endpoint_city parameter was also specified, the results will be grouped by ISP and then city. This parameter may NOT be used in conjunction with the endpoint_netspeed parameter
    • endpoint_netspeed: a specific connection type to filter results on. This parameter should be one of the following:
      • cabledsl
      • corporate
      • dialup
      • unknown

      A majority of our speedtest results are of type cabledsl. More information on how netspeed is determined is available here. Multiple netspeeds may be specified each separated by a pipe character. Set this parameter to the wildcard character * to return results for all connection speeds. If this parameter is specified, results will be grouped by netspeed in addition to existing grouping. For example, if the endpoint_city parameter was also specified, the results will be grouped by netspeed and then city. This parameter may not be used in conjunction with endpoint_isp
  • metric: the network benchmark metric to return. One of the following values:
    • downlink: the average downlink throughput measured in megabits per second (Mb/s) (DEFAULT)
    • uplink: the average uplink throughput measured in megabits per second (Mb/s)
    • latency: the average latency measured in milliseconds (ms)
  • start: only consider results from tests that occurred on or after this date
  • stop: only consider results from tests that occurred on or before this date
  • minNumTests: the minimum # of tests for a result to be included. The larger the number of tests in a result, the more reliable and accurate that metric will be
  • order: the ordering method, one of the following: asc: order results in ascending order; or desc: order results in descending order. The default ordering is descending for throughput and ascending or latency benchmark results

Constructing the Request

As you can see, the getNetworkBenchmarkResults web service support a complex array of parameters. For the purposes of this example, we'll only be using a few parameters:

  • serviceId: The serviceId we'll use is GoGrid:Servers which is the ID for the GoGrid server service
  • dataCenter: GoGrid currently operates out of 2 data centers, us-west and us-east. The us-west data center is located in California, so the dataCenter parameter we'll use is CA, US
  • testId: We are looking for results from the intercloud test. This is the default value for this parameter, so we do not need to include it in the request
  • endpoint_serviceId: We want to get throughput metrics for AWS S3, Zetta and Google Storage. This parameter supports multiple service IDs each separated by a pipe character. Thus, this parameter will be AWS:S3|Zetta:Storage|Google:Storage
  • endpoint_dataCenter: AWS, Zetta and Google all run storage services out of California. AWS also offers storage in Virginia, Ireland and Singapore. Since we do not want to include those data centers in the results, we'll set this parameter to CA, US
  • metric: Since we'll be doing a lot of uploading to the storage service, our primary are of concern is uplink throughput, so we'll set this parameter value to uplink (the default is downlink)

Request URI

Response

The API documentation states that the results will be an array of hashes each with the following possible values:

  • value: the average downlink (Mb/s), uplink (Mb/s) or latency (ms) for this network benchmark result. This value is based on the metric parameter specified (default is downlink)
  • originServiceId: The ID of the service this result originates from. Returned only when multiple serviceId parameters were specified
  • serviceId: The ID of the endpoint service this result pertains to. Returned if the endpoint_cloudId or endpoint_serviceId parameters were used
  • originDataCenter: The location of the data center this result originates from. Returned only if multiple dataCenter parameters were specified
  • dataCenter: The location of the endpoint data center this result pertains to. Returned if the endpoint_cloudId or endpoint_serviceId parameters were used
  • region: the geographical region this result pertains to. Returned if the endpoint_region, endpoint_city, endpoint_state or endpoint_country parameters were specified. For more information, see the API documentation for the getRegions web service
  • city: The name of the city this result pertains to. Returned if the endpoint_city parameter was specified. More information on how this data is obtained is available here
  • state: The 2 character identifier of the state or province this result pertains to. Only available for US or Canada based results. Returned if the endpoint_state parameter was specified. More information on how this data is obtained is available here
  • country: The 2 character ISO 3166 identifier of the country this result pertains to. Returned if the endpoint_country parameter was specified. More information on how this data is obtained is available here
  • isp: The name of the ISP this result pertains to. Returned if the endpoint_isp parameter was specified. More information on how this data is obtained is available here
  • netspeed: The connection speed used by the tester. Returned if the endpoint_netspeed parameter was specified. More information on how this data is obtained is available here
  • numTests: The number of tests that were averaged to produce this result
  • earliestTest: The date/time of the earliest test included in this result
  • latestTest: The date/time of the latest test included in this result

Because we are testing throughput from one cloud service to another, the results only include value, serviceId, dataCenter, numTests, earliestTest and latestTest. In the proceeding examples we'll see when the other response values are used. The results for this example at the time of this writing are:

  • AWS S3 US West: 161.22 Mb/s uplink (out of 212 tests)
  • Zetta: 63.1 Mb/s uplink (out of 213 tests)
  • Google Storage: 31.07 Mb/s uplink (out of 215 tests)

These results signify that AWS S3 US West region storage will generally provide the fastest uplink throughput from GoGrid US West cloud servers and may be the best service to use for backups (subject to other decision making criteria like price and support).

Example 18: Which CDN has the lowest latency in Europe

In the previous example we obtained network performance results based on our intercloud network testing. These tests are run periodically throughout the day to test throughput and latency between and within cloud services and Internet data centers. We also host a browser-based cloud speedtest to track throughput and latency between cloud services and primarily consumer-based high-speed Internet connections such as DSL and Cable. Users of the cloud speedtest select one or more cloud services to test, a test file size (1-5MB for download tests or 0.5-2.5MB for upload tests) and test to perform (uplink, downlink or latency). The speedtest then uploads/downloads the test file to/from the select cloud services and displays the latency and/or throughput results. We use MaxMind's GeoIP databases to track where the user is (city, state, country), the name of their ISP, and their connection speed using their IP address. This is a generally reliable method for obtaining this data with accuracy of about 99.8%. In addition to allowing Internet users to run this test for free, we also pay about 1000 users per month to run the test using Amazon's Mechanical Turk. All of these results are stored in our database and accessible through the getNetworkBenchmarkResults web service.

In this example, we want to find the CDN (Content Delivery Network) with the lowest throughput in Europe. We used the getRegionsweb service to discover that the region code for Europe is eu. CloudHarmony currently collects network benchmark metrics for about a dozen different CDNs. However, in this example, we've narrowed our CDN choices down to four: AWS CloudFront, MaxCDN, Edgecast or Akamai (resold by VPS.net). The request will be fairly simple, using the serviceId, testId, endpoint_region and metric parameters. The serviceId parameter supports multiple IDs each separated by a pipe character, so we will use that to specify the IDs of each of these 4 CDNs.

Request URI

Response

At the time of this writing, the results from these benchmarks were:

  • AWS CloudFront: 51.15ms (out of 191 tests)
  • MaxCDN: 56.65ms (out of 190 tests)
  • Edgecast: 51.38ms (out of 182 tests)
  • Akamai: 34.22ms (out of 195 tests)

So in this example, the clear winner was Akamai by a margin of about 35%. However, latency is not bad for any of these CDNs.

Example 19: What is the average downlink throughput for CDNs in California

In this example, we'll use the endpoint_isp and endpoint_state parameters to view performance of the Internap and AWS CloudFront CDNs in California, grouped by ISP. The endpoint_isp parameter can either be the name (or partial name) of an ISP such as Verizon, or a wildcard character * to indicate that all ISPs should be returned in the results. In this example, we'll use the wildcard option so the results are grouped by ISP. We will also use the minNumTests parameter so that only results with at least 5 tests completed are returned. The order=asc parameter is also used signifying that the slowest ISPs will show first in the results.

Request URI

Response

The response includes the average downlink throughput value, name of the isp, and the region identifier (us_west_pacific for all results in this example). Because more than 10 results are returned, we'll have to use the ws-offset=10 parameter to view the second page of results, ws-offset=20 for the third page and so on.

Example 20: Which CDN provides the best throughput in the APAC region

In this example, we'll determine which out of a handful of CDNs provides the best overall downlink throughput in the APAC region. We used the getRegions web service to discover that the region code for APAC is asia_apac. In this example, we'll evaluate Akamai, Edgecast, CloudFront, Microsoft Azure CDN and Limelight (resold by Rackspace Cloud).

Request URI

Response

At the time of this writing, the results from these benchmarks were:

  • Akamai: 2.69 Mb/s (out of 489 tests)
  • Edgecast: 2.51 Mb/s (out of 483 tests)
  • AWS CloudFront: 2.61 Mb/s (out of 493 tests)
  • Azure CDN: 3.29 Mb/s (out of 496 tests)
  • Limelight: 2.6 Mb/s (out of 483 tests)

So in this example it appears that Microsoft's Azure CDN service provides almost 20% better downlink throughput in APAC countries with almost 500 tests recorded.

Example 21: Which cloud server vendor has the best throughput in New York City

In this example, we'll use the endpoint_city parameter to determine which of a handful of cloud service provides has best downlink throughput in New York City. We will evaluate the following cloud server providers: AWS EC2 (US East region), GoGrid (US East region), Storm on Demand, Speedyrails (Quebec, CA), VoxCLOUD (New York) and Rackspace Cloud Servers (Texas data center). Because we are dealing with multiple services and multiple data centers, the serviceId and dataCenter parameters need to corresponding with the IDs of all 4 services and and data center locations. The web service will ignore data centers that are not valid for a given service (i.e. only Speedyrails has a data center in Quebec and only Voxel has a data center in New York).

Request URI

Response

At the time of this writing, the results from these benchmarks were:

  • AWS EC2 US East: 10.91 Mb/s (out of 45 tests)
  • GoGrid US East: 6.16 Mb/s (out of 12 tests)
  • Storm on Demand: 7.33 Mb/s (out of 39 tests)
  • Speedyrails: 10.39 Mb/s (out of 20 tests)
  • VoxCLOUD New York: 8.88 Mb/s (out of 36 tests)
  • Rackspace Cloud Servers - Chicago: 5.72 Mb/s (out of 65 tests)

So, with a limited number of test results (less than 100 results should not be considered to be reliable), AWS EC2 US East, Speedyrails and VoxCLOUD New York appear to provide the fastest downlink throughput to New York City (primarily consumer) Internet connections.

Conclusion

For now, we are offering free access to these web services for up to 10 requests per rolling 24-hour period. After 10 requests, you will receive a 503: Service Unavailable http response. This is a beta service and usage and terms are subject to change. If you would like an increased quota or professional support, please contact us. We'd of course also appreciate feedback and bug reports (send to info [at] cloudharmony.com).