This is the fourth post in our series on cloud server performance benchmarking. In this post, we'll examine memory IO performance using a compilation of 7 different benchmarks. This is the last of the performance posts to focus on synthetic benchmarks. Memory IO performance is particularly important for applications that read and write from memory heavily like caching systems and memory-based data stores such as memcache and redis.
All benchmarked cloud servers were configured almost identically using CentOS 64-bit (or 32-bit in the case of EC2 m1.small, c1.medium, Gandi, and IBM cloud servers).
We chose to use a dedicated, bare-metal cloud server as the performance baseline for this post. This will provide a more readily comparable reference to non-cloud configurations. The server we chose as the baseline is the NewServers Jumbo server configured with dual Intel E5504 quad core 2.00 GHz processors and 48GB DDR3 ECC ram. We chose NewServers because they are the only IaaS cloud that does not utilize a virtualization layer that could adversely affect the benchmark results. All NewServers servers run on top of physical hardware. We assigned the baseline server a score of 100. All other servers were assigned a score proportional to the performance of that server, where greater than 100 represents better results and less than 100 represents poorer results. For example, a server with a score of 50 scored 50% lower than the baseline server overall, while a server with a score of 125, scored 25% higher.
To compute the score, the results from each of the 7 benchmarks on the baseline server are compared to the same benchmark results for a cloud server. The baseline server benchmark score represents 100% for each benchmark. If a cloud server scores higher than the baseline it receives a score higher than 100% (based on how much higher the score is) and vise-versa for a lower score.
The following benchmarks were used to calculate the aggregate memory IO performance (MIOP) score displayed in the results tables below. Geekbench and Unixbench measure both CPU and memory IO performance and are included in the aggregate MIOP metric with a weight of 25 points each. To view the raw Geekbench and Unixbench scores for each of the servers in this post see the post What is an ECU? CPU Benchmarking in the Cloud.
CacheBench [weight=100]: This is a performance test of CacheBench, which is part of LLCbench. CacheBench is designed to test the memory and cache bandwidth performance. 50 points are assigned to each of the benchmarks read and write.
Geekbench [weight=25]: Geekbench provides a comprehensive set of benchmarks engineered to quickly and accurately measure processor and memory performance. Designed to make benchmarks easy to run and easy to understand, Geekbench takes the guesswork out of producing robust and reliable benchmark results.
hdparm cached reads [weight=50]: Determines the speed of reading directly from the Linux buffer cache without disk access. This measurement is essentially an indication of the throughput of the processor, cache, and memory of the system under test.
RAMspeed [weight=100]: This benchmark tests the system memory (RAM) performance. 33 points are assigned to each of the benchmarks add, copy and scale.
Redis Benchmark [weight=50]: Redis is an in-memory key-value store. It includes the redis-benchmark utility that simulates SETs/GETs/INCRs/LPUSHs/LPOPs/PINGs/LRANGEs done by N clients at the same time sending M total queries (it is similar to the Apache's ab utility). Our benchmark is performed with 50 simultaneous clients performing 100000 requests (./redis-benchmark -n 100000). The result is the average requests per second for all benchmark actions (gets, sets, etc.). For specific results, download the benchmark output source.
Stream [weight=100]: This benchmark tests the system memory (RAM) performance. 25 points are assigned to each of the benchmarks add, copy, scale and triad.
Unixbench [weight=25]: UnixBench provides a basic indicator of the performance of a Unix-like system. Multiple tests are used to test various aspects of the system's performance. These test results are then compared to the scores from a baseline system to produce an index value, which is generally easier to handle than the raw scores. The entire set of index values is then combined to make an overall index for the system.
This is the third post in our series on cloud server performance benchmarking. The previous blog posts, What is an ECU? CPU Benchmarking in the Cloudand Disk IO Benchmarking in the Cloudfocused on more synthetic (i.e. raw performance numbers with no real-world application) CPU and Disk IO performance. In this post, we'll look at performance using 4 common interpreted (full or byte-code) programming languages: Java, Ruby, Python and PHP.
This post is by no means intended to provide exhaustive and exact metrics for cloud server performance with these languages. The purpose of this and the other benchmarking posts is to provide a quick reference and starting point for further research and evaluation of different cloud providers. There is much variation in performance, features, reliability, support, pricing and other factors of different cloud providers, yet if you were to evaluate those providers based solely on marketing literature, you'd be hard pressed to distinguish one from another. Our goal is to reduce this ambiguity by providing objective, quantifiable measurements for comparing cloud providers.
All benchmarked cloud servers were configured almost identically in terms of OS and software:
Operating System: CentOS 64-bit (except for IBM Developer Cloud using RHEL 5 32-bit)
We chose to use a dedicated, bare-metal cloud server as the performance baseline for this post. This will provide a more readily comparable reference to non-cloud configurations. The server we chose as the baseline is Storm on Demand's bare-metal instance on E5506 2.13 hardware (dual processors - 8 cores total), with 4 x 15K RPM SAS drives (Raid 10) and 8GB ram. This is a fairly high-end server with many cores and very fast IO. We assigned this server a score of 100. All other servers were assigned a score proportional to the performance of that server, where greater than 100 represents better results and less than 100 represents poorer results. For example, a server with a score of 50 scored 50% lower than the baseline server overall, while a server with a score of 125, scored 25% higher.
To compute the score, the results from each of the 4 language benchmarks on the baseline server are compared to the same benchmark results for a cloud server. The baseline server benchmark score represents 100% for each benchmark. If a cloud server scores higher than the baseline it receives a score higher than 100% (based on how much higher the score is) and vise-versa for a lower score.
Server X aggregate score = SPECjvm2008: 80/100; ruby-benchmark-suite: 90/100; PyBench: 70/100; PHPBench 95/100
Total baseline score = 335
Server X Score = (335/400) * 100 = 83.75
The following benchmarks were used in this post:
SPECjvm (Java Virtual Machine Benchmark) [Higher score is better]: A benchmark suite for measuring the performance of a Java Runtime Environment (JRE), containing several real life applications and benchmarks focusing on core java functionality. The suite focuses on the performance of the JRE executing a single application; it reflects the performance of the hardware processor and memory subsystem, but has low dependence on file I/O and includes no network I/O across machines. The SPECjvm2008 workload mimics a variety of common general purpose application computations. These characteristics reflect the intent that this benchmark will be applicable to measuring basic Java performance on a wide variety of both client and server systems.
Ruby[Lower score is better]: A suite for measuring the performance of Ruby implementations, including micro-benchmarks that focus on core Ruby functionality, as well as macro-benchmarks that represent a variety of real, common workloads. The project is aimed at providing a useful suite for comparing the performance of the available Ruby implementations and, as a positive side effect, to give VM implementers an additional tool to measure and identity performance related issues. The score for this benchmark is the average of the median times to execute 13 of the macro benchmarks (all macro benchmarks minus bm_hilbert_matrix.rb). Specific times for these benchmarks individually are available in the benchmark source file.
Python[Lower score is better]: A collection of tests that provides a standardized way to measure the performance of Python implementations. The score for this benchmark is the average of all benchmarks performed by pybench.
PHP[Higher score is better]: A benchmark suite for PHP. It performs a large number of simple tests in order to benchmark various aspects of the PHP interpreter. PHPBench can be used to compare hardware, operating systems, PHP versions, PHP accelerators and caches, compiler options, etc.
Most storage services like Amazon's S3 and Azure Blob Storage are physically hosted from a single geographical location. S3 for example, is divided into 4 regions, US West, US East, EU West and APAC. If you store a file to an S3 US West bucket, it is not replicated to the other regions, and can only be downloaded from that regions' servers. The result is much slower network performance from locations with poor connectivity to that region. Hence, you need to chose your S3 region wisely based on geographical and/or network proximity. Azure's Blob storage uses a similar approach. Users are able to add CDN on top of those services (CloudFront and Azure CDN), but the CDN does not provide the same access control and consistency features of the storage service.
In contrast, Google's new Storage for Developers service appears to store uploaded files to a globally distributed network of servers. When a file is requested, Google uses some DNS magic to direct a user to a server that will provide the fastest access to that file. This is very similar to the approach that Content Delivery Networks like Akamai and Edgecast work, wherein files are distributed to multiple globally placed PoPs (point-of-presence).
Our simple test consisted of requesting a test 10MB file from 17 different servers located in the US, EU and APAC. The test file was set to public and we used wget to test downlink and Google's gsutil to test uplink throughput (wget was faster than gsutil for downloads). In doing so, we found the same test URL resolved to 11 different Google servers with an average downlink of about 40 Mb/s! This hybrid model of CDN-like performance with enterprise storage features like durability, consistency and access control represents an exciting leap forward for cloud storage!
This is the second post in our series on cloud server performance benchmarking. The previous blog post, What is an ECU? CPU Benchmarking in the Cloud, focused strictly on CPU performance. In this post, we'll look at disk IO performance. Choosing a cloud provider should be based on a many factors including performance, price, support, reliability/uptime, scalability, network performance and features. Our intent is to provide a good reference for those looking to use cloud services by providing objective analysis in regards to all of these factors.
All benchmarked cloud servers were configured almost identically in terms of OS and software, CentOS 5.4 64-bit (or 32-bit in the case of EC2 m1.small and c1.medium and IBM's Development Cloud where 64-bit is not supported). File systems were formatted ext3.
In the previous post we used Amazon's EC2 ECU (Elastic Compute Unit) as a baseline for comparison of CPU performance between providers. In terms of disk performance, there really isn't a common term synonymous to the ECU. However, most readers will be at least be somewhat familiar with hardware disk IO performance factors such drive types (SAS, SATA), spindle speeds (10K, 15K) and RAID levels. So, we chose to use a "bare-metal" cloud server as the baseline for disk IO performance in this post. Our experience has been that most providers will not disclose much technical detail about their underlying storage systems. However, in the case of Storm on Demand's new Bare Metal Cloud Servers they fully disclose most technical details about their servers (one of the selling points of this service). With this service, you are assigned a dedicated server. Your OS still runs on a hypervisor (Storm uses Xen), but the underlying hardware is not shared with any other virtual servers, and you thereby have the full resources of that server available (no CPU limits, disk or memory sharing).
The server model we chose as the performance comparison baseline is the dual processor Intel E5506 2.13 GHz (8 cores total) with 4 x 15K RPM SAS drives configured in hardware managed Raid 1+0. SAS 15K is one of the fastest storage interfaces with throughput up to 6 Gbp/s, and Raid 1+0 can improve performance using striping.
To use this server as the baseline we assigned it an aggregate IO Performance score (IOP) of exactly 100 points. Other server disk IO benchmark results were then compared to baseline results and assigned a relative score, where 100 is equal in performance, less than 100 worse, and greater than 100 better. For example, a server with a score of 50 scored 50% lower than the baseline server, while a server with a score of 125, scored 25% higher.
To compute the IOP, the results from each of the benchmarks is first calculated for the baseline server. Each benchmark has a weight for the overall aggregate score. The baseline server benchmark scores are the 100% mark for each of these weights (i.e. the baseline server receives the full weight of each benchmark for its IOP). Once the weights are calculated, they are then summed to create an aggregate score. This score is then compared with the aggregate score of the baseline server and use to produce the IOP.
Server X aggregate score = blogbench-read: 180/200; bonnie++: 80/100; dbench: 25/30; fio: 20/30; hdparm: 90/100; iozone: 175/200; tiobench: 28/30; Total aggregate score = 598
Server X IOP = (598/690) * 100 = 86.67
We used a combination 7 disk IO performance benchmarks to create the IOP score. The following is a description of the benchmarks and corresponding weights use:
Blogbench[weight=200]: BlogBench is designed to replicate the load of a real-world busy file server by stressing the file-system with multiple threads of random reads, writes, and rewrites. The behavior is mimicked of that of a blog by creating blogs with content and pictures, modifying blog posts, adding comments to these blogs, and then reading the content of the blogs. All of these blogs generated are created locally with fake content and pictures.
Bonnie++[weight=100]: Bonnie++ is based on the Bonnie hard drive benchmark by Tim Bray. This program is used by ReiserFS developers, but can be useful for anyone who wants to know how fast their hard drive or file system is.
Dbench (128 clients) [weight=30]: Dbench is a benchmark designed by the Samba project as a free alternative to netbench, but dbench contains only file-system calls for testing the disk performance.
Flexible IO Tester (fio) [weight=30]: fio is an I/O tool meant to be used both for benchmark and stress/hardware verification. It has support for 13 different types of I/O engines (sync, mmap, libaio, posixaio, SG v3, splice, null, network, syslet, guasi, solarisaio, and more), I/O priorities (for newer Linux kernels), rate I/O, forked or threaded jobs, and much more.
hdparm buffered disk reads[weight=100]: Determines the speed of reading through the buffer cache to the disk without any prior caching of data. This measurement is an indication of how fast the drive can sustain sequential data reads under Linux, without any filesystem overhead.
IOzone (4GB reads & writes) [weight=200]: The IOzone benchmark tests the hard disk drive/file-system performance.
Threaded I/O Tester (64 MB random write; write; read) [weight=30]: Tiotester (Threaded I/O Tester) benchmarks the hard disk drive / file-system performance.
We credit the Phoronix Test Suite for making it easier to run the benchmarks. The tests above come from the Disk test suite (except for bonnie++). If you'd like to compare your own server to the baseline server in this post, you can install Phoronix and use the comparison feature when running your own IO benchmarks. The full baseline server results are available here (including the ID for comparison tests). The baseline server bonnie++ results are available here.