Abstract
This post compares CPU performance and value for 18 compute instance types from 5 cloud compute platforms – AWS EC2, Google Compute Engine, Windows Azure, HP Cloud and Rackspace Cloud. The most interesting content is the data and resulting analysis. If you’re in a rush, scroll down or click below to go straight to it.
Overview
In the escalating cloud arms race, performance is a frequent topic of conversation. Often, overly simplistic test models and fuzzy logic are used to substantiate sweeping claims. In a general sense, computing performance is relative to, and dependent on workload type. There is no single metric or measurement that encapsulates performance as a whole.
In the context of cloud, performance is also subject to variability due to nondeterministic factors such as multitenancy and hardware abstraction. These factors combined increase the complexity of cloud performance analysis because they reduce one’s ability to dependably repeat and reproduce such analysis. This is not to say that cloud performance cannot be measured, rather that doing so is not a precise science, and differs somewhat from traditional hardware performance analysis where such factors are not present.
Performance is workload dependent. Cloud performance is hard to measure consistently because of variability from multitenancy and hardware abstraction.
Motivation
My goal in starting CloudHarmony in 2010 was to provide a credible source for objective and reliable performance analysis about cloud services. Since then, cloud has grown extensively and become an even more confusing place. The intent of this post is to present techniques and a visual tool we’re using to help assess and compare performance and value of cloud services. The focus of this post is cloud compute CPU performance and value. In the coming weeks, follow up posts will be published covering other performance topics including block storage, network, and object storage. As is our general policy, we have not been paid or otherwise influenced in the testing or analysis presented in this post.
The focus of this post is compute CPU performance and value. Follow up posts will cover other performance topics. We were not paid to write this post.
Testing Methods
To test performance of compute services we run a suite of about 100 benchmarks on each type of compute instance offered. These benchmarks measure various performance properties including CPU, memory and disk IO. Each test iteration takes between 1-2 days to complete. When multiple configuration options are offered, we usually run additional test iterations for each such option (e.g. compute services often offer multiple block storage options). Linux CentOS 6.* is our operating system of choice because of its nearly ubiquitous availability across services.
CPU Performance
Although our test suite includes many CPU benchmarks, our preferred method for compute CPU performance analysis is based on metrics provided by the CPU2006 benchmark suites. CPU2006 is an industry standard benchmark created by the Open Systems Group of the non-profit Standard Performance Evaluation Corporation (SPEC). CPU2006 consists of 2 benchmark suites that measure Integer and Floating Point CPU performance. The Integer suite contains 12 benchmarks, and Floating Point 17. According to the CPU2006 website “SPEC designed CPU2006 to provide a comparative measure of compute-intensive performance across the widest practical range of hardware using workloads developed from real user applications.” Thorough documentation about CPU2006 including about the benchmarks used is available on the CPU2006 website. CloudHarmony is a SPEC CPU2006 licensee.
The results table below contains CPU2006 SPECint (Integer) and SPECfp (Floating Point) metrics for each compute instance type included in this post. Each score is linked to a PDF report generated by the CPU2006 runtime for that specific test run. CPU2006 run and reporting rules require disclosure of settings and parameters used when compiling and running the CPU2006 test suites and this data is included in the reports. To summarize, our runs are based on the following settings:
- Compiler
- Intel C++ and Fortran Compilers version 12.1.5
- Compilation Guidelines
- Base
- Run Type
- Rate
- Rate Copies
- 1 copy per CPU core or per 1GB memory (lesser of the two)
- SSE Compiler Option
- SSE4.2 or SSE4.1 (if supported by the compute instance)
Our preferred method for compute CPU performance analysis is based on metrics provided by the SPEC CPU2006 benchmark suites
CPU2006 Test Results
To be considered official, CPU2006 results must adhere to specific run and reporting guidelines. One such guideline states that results should be reproducible. While this is important in the context of hardware testing, it is impractical for cloud due to performance variability resulting from multitenancy and hardware abstraction. However, CPU2006 guidelines allow for reporting of estimated results in cases where not all guidelines can be adhered to. In such cases results must be clearly designated as estimates. It is for this reason that results in the table below are designate as such.
Compute Service | Instance Type | CPU Type | Cores | Price2 | SPECint1 | SPECfp1 |
---|---|---|---|---|---|---|
AWS EC2 | cc2.8xlarge | Intel E5-2670 2.60GHz | 32 | 2.40 | 441.511194 | 357.602046 |
HP Cloud | double-extra-large | Intel T7700 2.40GHz | 8 | 1.12 | 168.55417 | 132.3234 |
AWS EC2 | m3.2xlarge | Intel E5-2670 2.60GHz | 8 | 1.00 | 150.30509 | 128.159625 |
Google Compute | n1-standard-8 | Intel 2.60GHz | 8 | 1.06 | 149.354133 | 143.1015 |
HP Cloud | extra-large | Intel T7700 2.40GHz | 4 | 0.56 | 98.430955 | 85.24574 |
Rackspace Cloud | 30gb | AMD Opteron 4170 | 8 | 1.00 | 95.43979 | 83.89602 |
Windows Azure | A4 | AMD Opteron 4171 | 8 | 0.48 | 91.33294 | 77.93744 |
AWS EC2 | m3.xlarge | Intel E5-2670 2.60GHz | 4 | 0.50 | 80.180578 | 71.753345 |
Google Compute | n1-standard-4 | Intel 2.60GHz | 4 | 0.53 | 66.945866 | 66.84303 |
Rackspace Cloud | 8gb | AMD Opteron 4170 | 4 | 0.32 | 51.709779 | 47.562079 |
Windows Azure | A3 | AMD Opteron 4171 | 4 | 0.24 | 51.58953 | 46.9475 |
HP Cloud | medium | Intel T7700 2.40GHz | 2 | 0.14 | 48.825275 | 44.085027 |
Google Compute | n1-standard-2 | Intel 2.60GHz | 2 | 0.265 | 39.469478 | 39.094813 |
AWS EC2 | m1.large | Intel E5645 2.40GHz | 2 | 0.24 | 39.023586 | 34.7884 |
AWS EC2 | m1.large | Intel E5-2650 2.00GHz | 2 | 0.24 | 38.816635 | 37.10992 |
AWS EC2 | m1.large | Intel E5430 2.66GHz | 2 | 0.24 | 29.534628 | 23.805172 |
Windows Azure | A2 | AMD Opteron 4171 | 2 | 0.18 | 27.38071 | 25.92939 |
Rackspace Cloud | 4gb | AMD Opteron 4170 | 2 | 0.16 | 25.854861 | 24.25972 |
1: Base/Rate – Estimate
2: Hourly, USD – On Demand
Simplifying the Results
In order to provide simple and concise analysis derived from multiple relevant performance properties, it is helpful to reduce metrics from multiple related benchmarks to a single comparable value. The CPU2006 benchmark suites produce two metrics, SPECint for Integer, and SPECfp for Floating Point performance. A naive approach might be to combine them using a mean or sum of their values. However, doing so would be inaccurate because they are dissimilar values. Although they are calculated using the same algorithms, SPECint and SPECfp are produced from different benchmarks, and thus represent different meanings – as the idiom goes, this would be an apples to oranges comparison. An external analogy might be attempting to average 1 gallon of milk with 2 dozen eggs – in doing so, the resulting value: $$(1+2)/2=1.5$$ is meaningless because they are dissimilar values to begin with.
To merge dissimilar values like metrics from different benchmarks, the values must first be normalized to a common notional scale. One method for doing so is ratio conversion using factors from a common scale. The resulting ratios represent relationships between the original metrics and the common scale. Because the values share the same scale, they may then be operated on together using mathematical functions like mean and median. Using the same milk and eggs analogy, and assuming a common scale of groceries needed for the week, defined as 2 gallons of milk and 3 dozen eggs, grocery deficiency ratios may then be calculated as follows: \[\text”Milk deficiency” = \text”2 gallons needed” / \text”1 gallon on hand” = \text”Deficiency ratio 2″\] \[\text”Eggs deficiency” = \text”3 dozen needed” / \text”2 dozen on hand” = \text”Deficiency ratio 1.5″\] The resulting ratios, 2 and 1.5, may then be reduced to a single ratio representing the average grocery deficiency for both milk and eggs: \[\text”Average grocery deficiency” = (2+1.5)/2 = \text”1.75″\] In other words, in order to stock up on groceries for the week, we’ll need to buy 1.75 times the milk and eggs currently on hand. Take note, however, that this ratio is only relevant in the context of milk and eggs as a whole, not separately, nor does it apply to other types of groceries.
The benefit of reducing dissimilar benchmarks values to a single representative metric is to simplify the expression and comparison of related performance properties. It allows us to present cloud performance more generally, and at a level more fitting to the interests and time of cloud users. As much as we’d like users to become well versed in the intricacies of benchmarking and performance analysis, this is simply not feasible for most, and is a primary reason for our existence. Our goal is to provide users with a simple starting point to help narrow the scope from hundreds of possible cloud services.
In order to more generally and simply present cloud performance information we generate a single value derived from multiple related benchmarks
CPU Performance Metric
The CPU performance metric displayed in the graph below was calculated using both SPECint and SPECfp metrics and the common scale ratio normalization technique described above. The common scale was the mid 80th percentile mean of all CloudHarmony SPECint and SPECfp test results from the prior year. These results included many different compute services and compute instance types, not just those included in this post. This calculation results in the following common normalization factors:
- SPECint Factor
- 64.056
- SPECfp Factor
- 55.995
To shorten resulting long decimal values, ratios were multiplied by 100. The meaning of the metric can thus be interpreted as CPU performance relative to the mean of compute instances from many different cloud services. A value of 100 represents performance comparable to the mean, 200 twice the mean, and 50 1/2 of the mean. For example, the HP double-extra-large compute instance produced scores of 168.55417 for SPECint, and 132.3234 for SPECfp. The resulting CPU performance metric of 249.72 was then calculated using the following formula: $$\text”CPU Performance”\ = (((168.55417/64.056) + (132.3234/55.995))/2)*100 → (4.99448532/2)*100 → 249.724266$$ The value 249.72 signifies this instance type performed about 2.5 times better than the mean.
The CPU performance metric used below represents SPECint and SPECfp scores relative to compute instances from many cloud services. A higher value is better
Value Calculation
Cloud compute pricing is usually tied to CPU and memory allocation, with larger instance types offering more (or faster) CPU cores and memory. The CPU2006 benchmark suites are designed to take advantage of multicore systems when compiled and run correctly. Given the same hardware type, our test results generally show a near linear correlation between CPU allocation and CPU2006 scores. Because of these factors, the CPU performance metric derived from CPU2006 is well-suited for estimating value of compute instance types. To do so, we calculate value by dividing the metric by the hourly USD instance cost. For example, the HP extra-large compute instance costs 0.56 USD per hour and has a performance metric of 152.96. The resulting value metric 273.14 is calculated using the following formula: $$\text”Fixed Value”\ = 152.96/0.56 → 273.142857$$
Tiered Value
The graph below allows selection of either Tiered or Fixed Value options. Tiered Value is Fixed Value with an adjustment applied to instances ranked in the top or bottom 20 percent. The table below lists the exact adjustments used. The concept behind tiered values is based loosely on CPU pricing models where the top end processors generally command premium per GHz pricing, while the low end is often discounted. The HP double-extra-large compute instance costs 1.12 USD per hour and has a performance metric of 249.72. It is also ranked in the 91st percentile which receives a +10% value adjustment. The resulting tiered value metric 245.256 is calculated using the following formula: $$\text”Tiered Value”\ = (249.72/1.12)*1.1 → 222.96*1.1 → 245.256$$
Ranking Percentile | Value Adjustment |
---|---|
Top 5% | +20% |
Top 10% | +10% |
Top 20% | +5% |
Mid 60% | None |
Bottom 20% | -5% |
Bottom 10% | -10% |
Bottom 5% | -20% |
Cloud compute pricing is usually tied to CPU and memory allocation. Value metrics in the graph below are derived by dividing CPU performance by the hourly cost
Price Normalization
Most cloud providers, including all those covered in this post, offer on demand hourly pricing for compute instances. In addition, some providers offer commit based pricing and volume discounts. AWS EC2 for example offers six 1 and 3 year reserve/commit based pricing tiers. These pricing tiers exchange lower hourly rates for a setup fee paid in advance, and in the case of heavy reserve, commitment to run the compute instance 24x7x365 for the duration of the term (light and medium reserve tiers do not have this requirement). In order to represent these pricing tiers in the graph below, the total cost was normalized to an hourly rate by amortizing the setup fee into the hourly rate. For example, the m3.xlarge instance type is offered under a 1 year heavy reserve tier for 1489 setup and 0.123 per hour. For this instance type and pricing model the hourly rate used in the graph and for value metrics was 0.293/hr calculated using the following formula: $$\text”Normalized Hourly Rate”\ = ((1489/365)/24) + 0.123 → 0.17 + 0.123 → 0.293$$
AWS EC2 is also available under a bid based pricing model called Spot pricing. Although spot pricing is typically priced substantially below standard rates, it is highly volatile and subject to transient spikes that may result in unexpected termination of instances without notice. Due to this, spot pricing is generally not recommended for long term usage. The spot pricing included in the graph below is based on a snapshot taken in early June 2013 and may not represent current rate.
Volume discount and membership based pricing like Windows Azure MSDN, were not included in the graph and value analysis because they are not as straight forward, and often require substantial monthly spend commitments at which users would likely be able to negotiate similar discounts with any vendor.
The graph provides a drop down list allowing select of different pricing models. When changed, the graph and table below will automatically update.
The AWS EC2 reserve hourly pricing in the graph below is based on a normalized hourly value calculated by amortizing the setup fee into the hourly rate
Visualizing Value & Performance
On our current website and in prior posts we’ve often used traditional bar charts to represent data visually. While this is a typical approach to presenting comparative analysis, it often resulted in lengthy displays, and did not lend well to large multivariate data sets. In the search for a more efficient and intuitive way to visualize such data, we discovered the D3 visualization library, which provides excellent tools and examples for creating data visualizations. It is based on this that we designed the graph below. The goal of this graph is to present large multivariate data sets in a concise, intuitive and interactive format. In a relatively small space, this graph allows users to observe many different characteristics of cloud services including:
- Performance
- The size or diameter of the circle represents proportional CPU performance of each compute instance. A larger circle represents more performant systems.
- Price & Value
- The fill color of each circle represents either the value or the price of each compute instance (defaults to value). Users can toggle between price, fixed value and tiered value fill options. Blue represents better value/lower price, while red represents lower value/higher price. A grey color is used for the midrange.
- Vertical Scalability
- Not all workloads lend well to horizontal scaling models (load is spread across many compute nodes). Legacy database servers for example often do not (easily) support multi-node clusters. By observing variation in circle sizes from small to large, users may better understand the vertical scaling range and limits of each cloud service.
- Instance Type Variability
- Results are grouped by instance type and CPU architecture. In the case of EC2, this allowed display of multiple records for a single instance type. The m1.large, for example, deployed to 3 different host types during our testing, each of which demonstrating slightly different performance characteristics.
- Multiple Pricing Models
- Users may view pricing and value based on different service pricing models. In the case of EC2, this allows toggling between on demand, reserve and spot pricing. Results in the graph and details table are updated instantly when the pricing model selection is changed.
Below the graph a sortable table displays details for each service and compute instance displayed in the graph. This table updates dynamically when fill color or pricing model selections are changed. Details for specific compute instances can also be viewed by hovering over a circle. In addition, users may zoom into a particular service by clicking on the container for that service. The graph can also be displayed in a larger popup view by clicking on the blue zoom icon displayed in the upper right corner when hovering over it.
The interactive graph below displays multiple characteristics of compute services and instance types including performance, price, value and vertical scalability. EC2 price and value can be toggled between on demand and reserve pricing tiers
HOW TO READ THIS DIAGRAM | ||
---|---|---|
Performance | Worse Better |
PerformancePerformance is represented by the diameter of the circle. Larger circles represent more performant systems. Close |
Price Hour » USD
Value
|
$1.50+
Lower
|
Price & ValuePrice and value are represented by the circle fill color. Blue represents lower pricing/better value. Close |
OPTIONS | ||
Fill Metric |
Fill MetricThe Value fill metric represents a ratio between performance and price, while Price represents a fixed hourly cost. Close |
|
Value Calculation |
Value CalculationFixed values are based on a simple ratio between performance and hourly cost. Tiered values are Fixed Values with an adjustment applied to services ranked in the top or bottom 5, 10 and 20 percent. Close |
CPU2006 Results Summary Diagram
This diagram displays the actual CPU2006 SPECint and SPECfp metrics for each compute service and instance type. Hovering over a specific segment in the diagram displays these metrics.
HOW TO READ THIS DIAGRAM | ||
---|---|---|
Benchmark Result |
Worse
|
Benchmark Results HelpSegments in this diagram depict individual benchmark metrics for each compute service and instance type. Segments are color coded where blue represents a better score and red worse. Close |
OPTIONS | ||
Group by Service |
|
Group by Services HelpWhen grouped by service, all instances for a specific compute service are listed together. The order of cloud services is based on the mean performance for all instance types belonging to that service. The service with the highest overall value appears in the 12 o’clock position. When not grouped by service, compute instances are ordered by mean results with the best performing instance located in the 12 o’clock position. Close |
Comments and Observations
As is our generally policy, we don’t recommend any one service over another. However, we’d like to point out some observations about each compute service included in this post.
AWS EC2
- On demand pricing provides similar value as other compute services. However, EC2 value increases substantially for reserve pricing models
- EC2 provides a broad performance range, topping out in this post with the 16 core (32 core hyper threaded) cc2.8xlarge instance type
- CPU architecture varies between instance types, with higher end types generally running on newer and faster hardware
- Older instance types like m1.large may deploy to different hardware platforms, and thus demonstrate variable performance. For example, there was a notable difference in performance between Intel E5430 and Intel E5-2650 based m1.large instances
- The cc2.8xlarge provides good value for multithreaded workloads with high CPU demand
Google Compute Engine (GCE)
- Performance increased near linearly from small to large instance types
- The n1-standard-4 performed roughly 10% slower than we expected (112 actual CPU performance versus 120-125 expected)
- The GCE hypervisor does not pass thru full CPU identifiers – but in GCE documentation Google has stated processors are based on the Intel Sandy Bridge (E5-2670) platform
- n1-standard-4 and n1-standard-8 instance types performed very similar to comparable EC2 instance types m3.xlarge and m3.2xlarge. All are based on the same Intel Sandy Bridge platform, and on demand pricing is also nearly the same (GCE is just a few cents higher)
Windows Azure
- The A3 and particularly A4 instance types are priced notably lower than instance types from other services with comparable CPU cores. This factor attributed to the higher value rankings associated with those instance types regardless of their performance being generally lower
- Vertical scalability is limited with the largest A4 VM (in terms of CPU cores) having the lowest performance ranking of all 8 cores instance types – however, at 1/2 the cost, the value is still good. Exclusive use of AMD 4171 2.1GHz processors (released in 2010) are also a limiting factor. The forthcoming release of Intel Sandybridge Azure Big Compute instance types may address this deficiency
HP Cloud
- HP compute instances provided marginally higher performance rankings for each of the 2, 4 and 8 core instance type groups
- For on demand pricing, the medium instance type provided the highest value ranking in the graph
- Performance increased 2X from medium (2 core) to extra-large (4 core) instance types, but the price difference is 4X. The 4 core large instance type between them was not tested
Rackspace Cloud
- Rackspace and Windows Azure performed nearly the same. Both are based on the AMD 4100 processor platform. However, Azure value is much higher for the 8 core A4 instance type (versus the Rackspace 8 core 30GB) because the cost is less than half (0.48/hr versus 1.00/hr – 14GB memory Azure versus 30GB Rackspace). The same applied to a lesser extent for the 2 and 4 core instance types (Azure A2/3.5GB and A3/7GB versus Rackspace 4GB and 8GB)
- The 30GB compute instance had the lowest value of all instance types included in this post
- Like Windows Azure, vertical scalability may be limited due to observed exclusive use of AMD 4170 2.1GHz processors (released in 2010). Rackspace does offer an upgrade path through its dedicated hosting offerings, however.
Next Up – Storage IO
CPU and storage IO are generally the two most important performance characteristics for compute services. Depending on workload, one might be more important than the other. Compute services often offer multiple storage options. Many storage options are networked and thus subject to higher variability than CPU and memory. Many workloads are sensitive to IO variations and may perform poorly in such environments. In the next post, we’ll present IO performance and consistency analysis for the same providers covered in this post. Storage options covered will include:
- AWS EC2
- Ephemeral, EBS, EBS Provisioned IOPS, EBS Optimized
- Google Compute Engine
- Local/Scratch, Persistent Storage
- HP Cloud
- Local, Block/External Storage
- Azure
- Local Replicated, Geo Replicated
- Rackspace
- Local, SATA and SSD Block/External Storage
Follow storage IO, we will also release posts covering network performance (inter-region, intra-region and external), and object storage IO.