Cloudscaling and KT - Private cloud validation using benchmarking

A few months ago we were contacted by Cloudscaling CEO Randy Bias regarding our work in benchmarking of public IaaS clouds (see previous blog posts). His team was working on a large private cloud deployment for KT, Korea's largest landline and second largest mobile carrier, and was interested in using similar techniques to validate that private cloud. This validation would include not only raw benchmark results, but also comparisons of how the private cloud stacked up against existing public clouds such as EC2 and GoGrid. This data would be useful not only for Cloudscaling to validate their own work, but also as a reference for their client KT. We agreed to the project and benchmarking was conducted over a 4 day period last August. Our deliverables included raw benchmark data and an executive report highlighting the results. In this post we will provide the results of these benchmarks.

Multi-Tenancy and Load Simulation

Benchmarking of public IaaS clouds involves a certain amount of ambiguity due to the scheduling and allocation of resources in multi-tenant virtualized environments. One of the fundamental jobs of a hypervisors such as VMware and Xen is to allocate shared resources in a fair and consistent manner. In order to maximize performance and utilization, they are designed to allocate resources such as CPU and Disk IO using a combination of fixed and burstable methods. For example, when a VM requests CPU resources, the hypervisor will generally provide more resources when neighboring VMs are idle versus when they are also requesting CPU resources. In very busy environments, this often results in variable and inconsistent VM performance.

Because the KT cloud benchmarking was conducted pre-launch, there was no other load in the environment besides our benchmarking. To offset this, we ran the benchmarks twice. In the first run, the benchmarks were run individually to provide maximum performance. In the second run, we attempted to simulate a loaded environment by filling the cloud to about 70% capacity with VMs instructed to perform a random sample of load simulating benchmarks (using mostly non-synthetic benchmarks like tpcc, blogbench and pgbench). The benchmarks for the second run were conducted concurrently with the load simulation. The tables and graphs below provide the unloaded benchmark results. Differences between those and the loaded results are noted above the results.

Organization of Results

The results below are separated into 2 general VM types, a large (16 & 32GB) VM and small (2GB) VM. Comparative data is also shown from public clouds including BlueLock, GoGrid, Amazon EC2, Terremark vCloud Express and Rackspace Cloud. We conducted similar benchmarking in these public clouds earlier this year. The results provided are based on 5 aggregate performance metrics we created and discussed in previous blog posts including:

TAGS:
Cloudscaling;
KT;
BlueLock;
GoGrid;
AWS EC2;
Terremark;
Rackspace
Read More

Benchmarking of EC2's new Cluster Compute Instance Type

Two months ago Amazon Web Services released a new "Cluster Compute" EC2 instance type, the cc.4xlarge. This new instance type is targeted for High-Performance Computing (HPC) such as computationally intensive scientific applications. The major differences between this and other EC2 instance types are:

Previously, we published 5 blog posts on cloud server performance which did not include this new EC2 instance type (cc.4xlarge) including:

The purpose of this post is to highlight the new EC2 Cluster Compute instance type in the context of these benchmarks and how it performs relative to the other EC2 instance types and servers in other IaaS clouds. For specifics on how the benchmarks are conducted and scores calculated, review the previous blog posts linked above. The benchmarks were performed on an individual cc.4xlarge instance and measure performance of a single instance only. The most beneficial feature of this new instance type is the clustering capabilities via 10 Gbps non-blocking network, which is not highlighted in this post.

The new cluster compute instance type is currently only available in Amazon's US-East region. The benchmark results tables below show only EC2 instances from that same region. NOTE: Although the EC2 documentation states that the cluster compute instance is assigned 2 quad core processors (8 cores total), the processors' hyper-threading capabilities resulted in benchmarks reporting 16 total cores.

TAGS:
AWS EC2 cc1.4xlarge;
CPU Performance;
Memory IO Performance;
Storage IO Performance;
Application Performance;
Encoding Performance
Read More

Cloud Server Benchmarking Part 5 - Encoding and Encryption

This is the fifth post in our series on cloud server performance benchmarking. In this post, we'll look at encoding and encryption performance using a compilation of 7 different benchmarks.

Benchmark Setup

All benchmarked cloud servers were configured almost identically using CentOS 64-bit (or 32-bit in the case of EC2 m1.small, c1.medium, Gandi, and IBM cloud servers).

Benchmark Methodology

Individual benchmark scores are calculated using the Phoronix Test Suite. Phoronix runs each test 3 times or until the standard deviation between each execution is less than 3.5% to improve statistical accuracy.

We chose to use a dedicated, bare-metal cloud server as the performance baseline for this post. This will provide a more readily comparable reference to non-cloud configurations. The server we chose as the baseline is the NewServers Jumbo server configured with dual Intel E5504 quad core 2.00 GHz processors and 48GB DDR3 ECC ram. We chose NewServers because they are the only IaaS cloud that does not utilize a virtualization layer that could adversely affect the benchmark results. All NewServers servers run on top of physical hardware. We assigned the baseline server a score of 100. All other servers were assigned a score proportional to the performance of that server, where greater than 100 represents better results and less than 100 represents poorer results. For example, a server with a score of 50 scored 50% lower than the baseline server overall, while a server with a score of 125, scored 25% higher.

To compute the score, the results from each of the 7 benchmarks on the baseline server are compared to the same benchmark results for a cloud server. The baseline server benchmark score represents 100% for each benchmark. If a cloud server scores higher than the baseline it receives a score higher than 100% (based on how much higher the score is) and vise-versa for a lower score.

Benchmarks

The following benchmarks were used to calculate the aggregate encoding performance (Encode) score displayed in the results tables below.

TAGS:
Encoding Performance;
AWS EC2;
Rackspace;
Storm on Demand;
GoGrid;
VoxCLOUD;
NewServers;
baremetalcloud
Read More

Cloud Server Benchmarking Part 4 - Memory IO

This is the fourth post in our series on cloud server performance benchmarking. In this post, we'll examine memory IO performance using a compilation of 7 different benchmarks. This is the last of the performance posts to focus on synthetic benchmarks. Memory IO performance is particularly important for applications that read and write from memory heavily like caching systems and memory-based data stores such as memcache and redis.

Benchmark Setup

All benchmarked cloud servers were configured almost identically using CentOS 64-bit (or 32-bit in the case of EC2 m1.small, c1.medium, Gandi, and IBM cloud servers).

Benchmark Methodology

We chose to use a dedicated, bare-metal cloud server as the performance baseline for this post. This will provide a more readily comparable reference to non-cloud configurations. The server we chose as the baseline is the NewServers Jumbo server configured with dual Intel E5504 quad core 2.00 GHz processors and 48GB DDR3 ECC ram. We chose NewServers because they are the only IaaS cloud that does not utilize a virtualization layer that could adversely affect the benchmark results. All NewServers servers run on top of physical hardware. We assigned the baseline server a score of 100. All other servers were assigned a score proportional to the performance of that server, where greater than 100 represents better results and less than 100 represents poorer results. For example, a server with a score of 50 scored 50% lower than the baseline server overall, while a server with a score of 125, scored 25% higher.

To compute the score, the results from each of the 7 benchmarks on the baseline server are compared to the same benchmark results for a cloud server. The baseline server benchmark score represents 100% for each benchmark. If a cloud server scores higher than the baseline it receives a score higher than 100% (based on how much higher the score is) and vise-versa for a lower score.

Benchmarks

The following benchmarks were used to calculate the aggregate memory IO performance (MIOP) score displayed in the results tables below. Geekbench and Unixbench measure both CPU and memory IO performance and are included in the aggregate MIOP metric with a weight of 25 points each. To view the raw Geekbench and Unixbench scores for each of the servers in this post see the post What is an ECU? CPU Benchmarking in the Cloud.