# Value of the Cloud - CPU Performanceby Jason Read

## Abstract

This post compares CPU performance and value for 18 compute instance types from 5 cloud compute platforms - AWS EC2, Google Compute Engine, Windows Azure, HP Cloud and Rackspace Cloud. The most interesting content is the data and resulting analysis. If you're in a rush, scroll down or click below to go straight to it.

## Overview

In the escalating cloud arms race, performance is a frequent topic of conversation. Often, overly simplistic test models and fuzzy logic are used to substantiate sweeping claims. In a general sense, computing performance is relative to, and dependent on workload type. There is no single metric or measurement that encapsulates performance as a whole.

In the context of cloud, performance is also subject to variability due to nondeterministic factors such as multitenancy and hardware abstraction. These factors combined increase the complexity of cloud performance analysis because they reduce one's ability to dependably repeat and reproduce such analysis. This is not to say that cloud performance cannot be measured, rather that doing so is not a precise science, and differs somewhat from traditional hardware performance analysis where such factors are not present.

Performance is workload dependent. Cloud performance is hard to measure consistently because of variability from multitenancy and hardware abstraction.

## Motivation

My goal in starting CloudHarmony in 2010 was to provide a credible source for objective and reliable performance analysis about cloud services. Since then, cloud has grown extensively and become an even more confusing place. The intent of this post is to present techniques and a visual tool we're using to help assess and compare performance and value of cloud services. The focus of this post is cloud compute CPU performance and value. In the coming weeks, follow up posts will be published covering other performance topics including block storage, network, and object storage. As is our general policy, we have not been paid or otherwise influenced in the testing or analysis presented in this post.

The focus of this post is compute CPU performance and value. Follow up posts will cover other performance topics. We were not paid to write this post.

## Testing Methods

To test performance of compute services we run a suite of about 100 benchmarks on each type of compute instance offered. These benchmarks measure various performance properties including CPU, memory and disk IO. Each test iteration takes between 1-2 days to complete. When multiple configuration options are offered, we usually run additional test iterations for each such option (e.g. compute services often offer multiple block storage options). Linux CentOS 6.* is our operating system of choice because of its nearly ubiquitous availability across services.

### CPU Performance

Although our test suite includes many CPU benchmarks, our preferred method for compute CPU performance analysis is based on metrics provided by the CPU2006 benchmark suites. CPU2006 is an industry standard benchmark created by the Open Systems Group of the non-profit Standard Performance Evaluation Corporation (SPEC). CPU2006 consists of 2 benchmark suites that measure Integer and Floating Point CPU performance. The Integer suite contains 12 benchmarks, and Floating Point 17. According to the CPU2006 website "SPEC designed CPU2006 to provide a comparative measure of compute-intensive performance across the widest practical range of hardware using workloads developed from real user applications." Thorough documentation about CPU2006 including about the benchmarks used is available on the CPU2006 website. CloudHarmony is a SPEC CPU2006 licensee.

The results table below contains CPU2006 SPECint (Integer) and SPECfp (Floating Point) metrics for each compute instance type included in this post. Each score is linked to a PDF report generated by the CPU2006 runtime for that specific test run. CPU2006 run and reporting rules require disclosure of settings and parameters used when compiling and running the CPU2006 test suites and this data is included in the reports. To summarize, our runs are based on the following settings:

Compiler
Intel C++ and Fortran Compilers version 12.1.5
Compilation Guidelines
Base
Run Type
Rate
Rate Copies
1 copy per CPU core or per 1GB memory (lesser of the two)
SSE Compiler Option
SSE4.2 or SSE4.1 (if supported by the compute instance)
Our preferred method for compute CPU performance analysis is based on metrics provided by the SPEC CPU2006 benchmark suites

### CPU2006 Test Results

To be considered official, CPU2006 results must adhere to specific run and reporting guidelines. One such guideline states that results should be reproducible. While this is important in the context of hardware testing, it is impractical for cloud due to performance variability resulting from multitenancy and hardware abstraction. However, CPU2006 guidelines allow for reporting of estimated results in cases where not all guidelines can be adhered to. In such cases results must be clearly designated as estimates. It is for this reason that results in the table below are designate as such.

Compute ServiceInstance TypeCPU TypeCoresPrice2SPECint1SPECfp1
AWS EC2cc2.8xlargeIntel E5-2670 2.60GHz322.40441.511194357.602046
HP Clouddouble-extra-largeIntel T7700 2.40GHz81.12168.55417132.3234
AWS EC2m3.2xlargeIntel E5-2670 2.60GHz81.00150.30509128.159625
HP Cloudextra-largeIntel T7700 2.40GHz40.5698.43095585.24574
Rackspace Cloud30gbAMD Opteron 417081.0095.4397983.89602
Windows AzureA4AMD Opteron 417180.4891.3329477.93744
AWS EC2m3.xlargeIntel E5-2670 2.60GHz40.5080.18057871.753345
Rackspace Cloud8gbAMD Opteron 417040.3251.70977947.562079
Windows AzureA3AMD Opteron 417140.2451.5895346.9475
HP CloudmediumIntel T7700 2.40GHz20.1448.82527544.085027
AWS EC2m1.largeIntel E5645 2.40GHz20.2439.02358634.7884
AWS EC2m1.largeIntel E5-2650 2.00GHz20.2438.81663537.10992
AWS EC2m1.largeIntel E5430 2.66GHz20.2429.53462823.805172
Windows AzureA2AMD Opteron 417120.1827.3807125.92939
Rackspace Cloud4gbAMD Opteron 417020.1625.85486124.25972

1: Base/Rate - Estimate
2: Hourly, USD - On Demand

## Simplifying the Results

In order to provide simple and concise analysis derived from multiple relevant performance properties, it is helpful to reduce metrics from multiple related benchmarks to a single comparable value. The CPU2006 benchmark suites produce two metrics, SPECint for Integer, and SPECfp for Floating Point performance. A naive approach might be to combine them using a mean or sum of their values. However, doing so would be inaccurate because they are dissimilar values. Although they are calculated using the same algorithms, SPECint and SPECfp are produced from different benchmarks, and thus represent different meanings - as the idiom goes, this would be an apples to oranges comparison. An external analogy might be attempting to average 1 gallon of milk with 2 dozen eggs - in doing so, the resulting value: $$(1+2)/2=1.5$$ is meaningless because they are dissimilar values to begin with.

To merge dissimilar values like metrics from different benchmarks, the values must first be normalized to a common notional scale. One method for doing so is ratio conversion using factors from a common scale. The resulting ratios represent relationships between the original metrics and the common scale. Because the values share the same scale, they may then be operated on together using mathematical functions like mean and median. Using the same milk and eggs analogy, and assuming a common scale of groceries needed for the week, defined as 2 gallons of milk and 3 dozen eggs, grocery deficiency ratios may then be calculated as follows: $\text"Milk deficiency" = \text"2 gallons needed" / \text"1 gallon on hand" = \text"Deficiency ratio 2"$ $\text"Eggs deficiency" = \text"3 dozen needed" / \text"2 dozen on hand" = \text"Deficiency ratio 1.5"$ The resulting ratios, 2 and 1.5, may then be reduced to a single ratio representing the average grocery deficiency for both milk and eggs: $\text"Average grocery deficiency" = (2+1.5)/2 = \text"1.75"$ In other words, in order to stock up on groceries for the week, we'll need to buy 1.75 times the milk and eggs currently on hand. Take note, however, that this ratio is only relevant in the context of milk and eggs as a whole, not separately, nor does it apply to other types of groceries.

The benefit of reducing dissimilar benchmarks values to a single representative metric is to simplify the expression and comparison of related performance properties. It allows us to present cloud performance more generally, and at a level more fitting to the interests and time of cloud users. As much as we'd like users to become well versed in the intricacies of benchmarking and performance analysis, this is simply not feasible for most, and is a primary reason for our existence. Our goal is to provide users with a simple starting point to help narrow the scope from hundreds of possible cloud services.

In order to more generally and simply present cloud performance information we generate a single value derived from multiple related benchmarks

### CPU Performance Metric

The CPU performance metric displayed in the graph below was calculated using both SPECint and SPECfp metrics and the common scale ratio normalization technique described above. The common scale was the mid 80th percentile mean of all CloudHarmony SPECint and SPECfp test results from the prior year. These results included many different compute services and compute instance types, not just those included in this post. This calculation results in the following common normalization factors:

SPECint Factor
64.056
SPECfp Factor
55.995

To shorten resulting long decimal values, ratios were multiplied by 100. The meaning of the metric can thus be interpreted as CPU performance relative to the mean of compute instances from many different cloud services. A value of 100 represents performance comparable to the mean, 200 twice the mean, and 50 1/2 of the mean. For example, the HP double-extra-large compute instance produced scores of 168.55417 for SPECint, and 132.3234 for SPECfp. The resulting CPU performance metric of 249.72 was then calculated using the following formula: $$\text"CPU Performance"\ = (((168.55417/64.056) + (132.3234/55.995))/2)*100 → (4.99448532/2)*100 → 249.724266$$ The value 249.72 signifies this instance type performed about 2.5 times better than the mean.

The CPU performance metric used below represents SPECint and SPECfp scores relative to compute instances from many cloud services. A higher value is better

## Value Calculation

Cloud compute pricing is usually tied to CPU and memory allocation, with larger instance types offering more (or faster) CPU cores and memory. The CPU2006 benchmark suites are designed to take advantage of multicore systems when compiled and run correctly. Given the same hardware type, our test results generally show a near linear correlation between CPU allocation and CPU2006 scores. Because of these factors, the CPU performance metric derived from CPU2006 is well-suited for estimating value of compute instance types. To do so, we calculate value by dividing the metric by the hourly USD instance cost. For example, the HP extra-large compute instance costs 0.56 USD per hour and has a performance metric of 152.96. The resulting value metric 273.14 is calculated using the following formula: $$\text"Fixed Value"\ = 152.96/0.56 → 273.142857$$

### Tiered Value

The graph below allows selection of either Tiered or Fixed Value options. Tiered Value is Fixed Value with an adjustment applied to instances ranked in the top or bottom 20 percent. The table below lists the exact adjustments used. The concept behind tiered values is based loosely on CPU pricing models where the top end processors generally command premium per GHz pricing, while the low end is often discounted. The HP double-extra-large compute instance costs 1.12 USD per hour and has a performance metric of 249.72. It is also ranked in the 91st percentile which receives a +10% value adjustment. The resulting tiered value metric 245.256 is calculated using the following formula: $$\text"Tiered Value"\ = (249.72/1.12)*1.1 → 222.96*1.1 → 245.256$$

Top 5%+20%
Top 10%+10%
Top 20%+5%
Mid 60%None
Bottom 20%-5%
Bottom 10%-10%
Bottom 5%-20%
Cloud compute pricing is usually tied to CPU and memory allocation. Value metrics in the graph below are derived by dividing CPU performance by the hourly cost

### Price Normalization

Most cloud providers, including all those covered in this post, offer on demand hourly pricing for compute instances. In addition, some providers offer commit based pricing and volume discounts. AWS EC2 for example offers six 1 and 3 year reserve/commit based pricing tiers. These pricing tiers exchange lower hourly rates for a setup fee paid in advance, and in the case of heavy reserve, commitment to run the compute instance 24x7x365 for the duration of the term (light and medium reserve tiers do not have this requirement). In order to represent these pricing tiers in the graph below, the total cost was normalized to an hourly rate by amortizing the setup fee into the hourly rate. For example, the m3.xlarge instance type is offered under a 1 year heavy reserve tier for 1489 setup and 0.123 per hour. For this instance type and pricing model the hourly rate used in the graph and for value metrics was 0.293/hr calculated using the following formula: $$\text"Normalized Hourly Rate"\ = ((1489/365)/24) + 0.123 → 0.17 + 0.123 → 0.293$$

AWS EC2 is also available under a bid based pricing model called Spot pricing. Although spot pricing is typically priced substantially below standard rates, it is highly volatile and subject to transient spikes that may result in unexpected termination of instances without notice. Due to this, spot pricing is generally not recommended for long term usage. The spot pricing included in the graph below is based on a snapshot taken in early June 2013 and may not represent current rate.

Volume discount and membership based pricing like Windows Azure MSDN, were not included in the graph and value analysis because they are not as straight forward, and often require substantial monthly spend commitments at which users would likely be able to negotiate similar discounts with any vendor.

The graph provides a drop down list allowing select of different pricing models. When changed, the graph and table below will automatically update.

The AWS EC2 reserve hourly pricing in the graph below is based on a normalized hourly value calculated by amortizing the setup fee into the hourly rate

## Visualizing Value & Performance

On our current website and in prior posts we've often used traditional bar charts to represent data visually. While this is a typical approach to presenting comparative analysis, it often resulted in lengthy displays, and did not lend well to large multivariate data sets. In the search for a more efficient and intuitive way to visualize such data, we discovered the D3 visualization library, which provides excellent tools and examples for creating data visualizations. It is based on this that we designed the graph below. The goal of this graph is to present large multivariate data sets in a concise, intuitive and interactive format. In a relatively small space, this graph allows users to observe many different characteristics of cloud services including:

Performance
The size or diameter of the circle represents proportional CPU performance of each compute instance. A larger circle represents more performant systems.
Price & Value
The fill color of each circle represents either the value or the price of each compute instance (defaults to value). Users can toggle between price, fixed value and tiered value fill options. Blue represents better value/lower price, while red represents lower value/higher price. A grey color is used for the midrange.
Vertical Scalability
Not all workloads lend well to horizontal scaling models (load is spread across many compute nodes). Legacy database servers for example often do not (easily) support multi-node clusters. By observing variation in circle sizes from small to large, users may better understand the vertical scaling range and limits of each cloud service.
Instance Type Variability
Results are grouped by instance type and CPU architecture. In the case of EC2, this allowed display of multiple records for a single instance type. The m1.large, for example, deployed to 3 different host types during our testing, each of which demonstrating slightly different performance characteristics.
Multiple Pricing Models
Users may view pricing and value based on different service pricing models. In the case of EC2, this allows toggling between on demand, reserve and spot pricing. Results in the graph and details table are updated instantly when the pricing model selection is changed.

Below the graph a sortable table displays details for each service and compute instance displayed in the graph. This table updates dynamically when fill color or pricing model selections are changed. Details for specific compute instances can also be viewed by hovering over a circle. In addition, users may zoom into a particular service by clicking on the container for that service. The graph can also be displayed in a larger popup view by clicking on the blue zoom icon displayed in the upper right corner when hovering over it.

The interactive graph below displays multiple characteristics of compute services and instance types including performance, price, value and vertical scalability. EC2 price and value can be toggled between on demand and reserve pricing tiers
Performance Worse
Better
##### Performance

Performance is represented by the diameter of the circle. Larger circles represent more performant systems.

Close
Value
Lower
Higher
##### Price & Value

Price and value are represented by the circle fill color. Blue represents lower pricing/better value.

Close
OPTIONS
Fill Metric
##### Fill Metric

The Value fill metric represents a ratio between performance and price, while Price represents a fixed hourly cost.

Close
Value Calculation
##### Value Calculation

Fixed values are based on a simple ratio between performance and hourly cost. Tiered values are Fixed Values with an adjustment applied to services ranked in the top or bottom 5, 10 and 20 percent.

Close

## CPU2006 Results Summary Diagram

This diagram displays the actual CPU2006 SPECint and SPECfp metrics for each compute service and instance type. Hovering over a specific segment in the diagram displays these metrics.

Benchmark Result Worse
Better
##### Benchmark Results Help

Segments in this diagram depict individual benchmark metrics for each compute service and instance type. Segments are color coded where blue represents a better score and red worse.

Close
OPTIONS
Group by Service
##### Group by Services Help

When grouped by service, all instances for a specific compute service are listed together. The order of cloud services is based on the mean performance for all instance types belonging to that service. The service with the highest overall value appears in the 12 o'clock position. When not grouped by service, compute instances are ordered by mean results with the best performing instance located in the 12 o'clock position.

Close

As is our generally policy, we don't recommend any one service over another. However, we'd like to point out some observations about each compute service included in this post.

### AWS EC2

• On demand pricing provides similar value as other compute services. However, EC2 value increases substantially for reserve pricing models
• EC2 provides a broad performance range, topping out in this post with the 16 core (32 core hyper threaded) cc2.8xlarge instance type
• CPU architecture varies between instance types, with higher end types generally running on newer and faster hardware
• Older instance types like m1.large may deploy to different hardware platforms, and thus demonstrate variable performance. For example, there was a notable difference in performance between Intel E5430 and Intel E5-2650 based m1.large instances
• The cc2.8xlarge provides good value for multithreaded workloads with high CPU demand

• Performance increased near linearly from small to large instance types
• The n1-standard-4 performed roughly 10% slower than we expected (112 actual CPU performance versus 120-125 expected)
• The GCE hypervisor does not pass thru full CPU identifiers - but in GCE documentation Google has stated processors are based on the Intel Sandy Bridge (E5-2670) platform
• n1-standard-4 and n1-standard-8 instance types performed very similar to comparable EC2 instance types m3.xlarge and m3.2xlarge. All are based on the same Intel Sandy Bridge platform, and on demand pricing is also nearly the same (GCE is just a few cents higher)

### Windows Azure

• The A3 and particularly A4 instance types are priced notably lower than instance types from other services with comparable CPU cores. This factor attributed to the higher value rankings associated with those instance types regardless of their performance being generally lower
• Vertical scalability is limited with the largest A4 VM (in terms of CPU cores) having the lowest performance ranking of all 8 cores instance types - however, at 1/2 the cost, the value is still good. Exclusive use of AMD 4171 2.1GHz processors (released in 2010) are also a limiting factor. The forthcoming release of Intel Sandybridge Azure Big Compute instance types may address this deficiency

### HP Cloud

• HP compute instances provided marginally higher performance rankings for each of the 2, 4 and 8 core instance type groups
• For on demand pricing, the medium instance type provided the highest value ranking in the graph
• Performance increased 2X from medium (2 core) to extra-large (4 core) instance types, but the price difference is 4X. The 4 core large instance type between them was not tested

### Rackspace Cloud

• Rackspace and Windows Azure performed nearly the same. Both are based on the AMD 4100 processor platform. However, Azure value is much higher for the 8 core A4 instance type (versus the Rackspace 8 core 30GB) because the cost is less than half (0.48/hr versus 1.00/hr - 14GB memory Azure versus 30GB Rackspace). The same applied to a lesser extent for the 2 and 4 core instance types (Azure A2/3.5GB and A3/7GB versus Rackspace 4GB and 8GB)
• The 30GB compute instance had the lowest value of all instance types included in this post
• Like Windows Azure, vertical scalability may be limited due to observed exclusive use of AMD 4170 2.1GHz processors (released in 2010). Rackspace does offer an upgrade path through its dedicated hosting offerings, however.

## Next Up - Storage IO

CPU and storage IO are generally the two most important performance characteristics for compute services. Depending on workload, one might be more important than the other. Compute services often offer multiple storage options. Many storage options are networked and thus subject to higher variability than CPU and memory. Many workloads are sensitive to IO variations and may perform poorly in such environments. In the next post, we'll present IO performance and consistency analysis for the same providers covered in this post. Storage options covered will include:

AWS EC2
Ephemeral, EBS, EBS Provisioned IOPS, EBS Optimized