Do SLAs really matter? A 1 year case study of 38 cloud services

In late 2009 we began monitored the availability of various cloud services. To do so, we partnered or contracted with cloud vendors to let us maintain, monitor and benchmark the services they offered. These include IaaS vendors (i.e. cloud servers, storage, CDNs) such as GoGrid and Rackspace Cloud, and PaaS services such as Microsoft Azure and AppEngine. We use Panopta to provide monitoring, outage confirmation, and availability metric calculation. Panopta provides reliable monitoring metrics using a multi-node outage confirmation process wherein each outage is verified by 4 geographically dispersed monitoring nodes. Additionally, we attempt to manually confirm and document all outages greater than 5 minutes using our vendor contacts or the provider's status page (if available). Outages triggered due to scheduled maintenance are removed. DoS ([distributed] denial of service) outages are also removed if the vendor is able to restore service within a short period of time. Any outages triggered by us (e.g. server reboots) are also removed.

The purpose of this post is to compare the availability metrics we have collected over the past year with vendor SLAs to determine if in fact there is any correlation between the two.

SLA Credit Policies

In researching various vendor SLA policies for this post, we discovered a few general themes with regards to SLA credit policies we'd like to mention here. These include the following:

The most fair and simple of these policies seems to be the pro-rated method, while the threshold method seems to give the provider the greatest protection and flexibility (based on our data, most outages tend to be shorter than the thresholds used by the vendors). In the table below, we will attempt to identify which of these SLA credit policies used by each vendor. Vendors that apply a threshold policy are highlighted in red.

SLAs versus Measured Availability

The SLA data provided below is based on current documentation provided on each vendor's website. The Actual column is based on 1 year of monitoring (a few of the services listed have been monitored for less than 1 year), using servers we maintain with each of these vendors. We have included 38 IaaS providers in the table. We currently monitor and maintain availability data on 90 different cloud services. The Actual column is highlighted green if it is equal to or exceeds the SLA.

Provider Data Center Total # Outages / Mins Down SLA Credit Policy SLA Actual
AWS EC2 US East 0/0 Percentage
10% invoice credit anytime annual uptime falls below 99.5%
99.5% 100%
AWS EC2 US West 0/0 Percentage
10% invoice credit anytime annual uptime falls below 99.5%
99.5% 100%
GoGrid US West 0/0 Pro-rated
100x credit for any downtime
100% 100%
Linode VPS London 0/0 Pro-rated
1x credit for downtime exceeding 0.1%
99.9% 100%
OpSource Cloud VA, US 0/0 Percentage
5% invoice credit for 60 minutes downtime 10% for up to 120 minutes and so on
100% 100%
Storm on Demand MI, US 0/0 Pro-rated
10x credit for any downtime
100% 100%
VoxCLOUD EU 0/0 Percentage
5% invoice credit per 0.1% downtime up to 100%
100% 100%
GoGrid US East 1/2.3 Pro-rated
100x credit for any downtime
100% 99.999%
Joyent Smart Machines Andover, MA 1/3 Percentage
5% of the monthly fee for each 30 minutes of downtime
100% 99.999%
VoxCLOUD Singapore 1/5.5 Percentage
5% invoice credit per 0.1% downtime up to 100%
100% 99.999%
Speedyrails VPS Peer1 Quebec 1/2.2 Percentage
3% of monthly fees for every 0.1% of downtime
99.9% 99.999%
Rackspace Cloud Dallas, TX 1/8.7 Threshold/ Percentage
5% of the fees for each 30 minutes of network downtime (1 hour for hardware) up to 100% Host hardware failures guaranteed to be fixed within 1 hour of problem identification
100%1 99.998%
SoftLayer CloudLayer Dallas, TX 4/13.9 Threshold/ Percentage
5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%1 99.997%
Hosting.com Colorado 1/1.4 Percentage
1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair
100%1 99.997%
AWS EC2 APAC 5/14.8 Percentage
10% invoice credit anytime annual uptime falls below 99.5%
99.5% 99.996%
Linode Atlanta 10/26.9 Pro-rated
Pro-rated 1x credit for downtime exceeding 0.1%
99.9% 99.995%
Joyent Smart Machines Emeryville, CA 4/15.2 Percentage
5% of the monthly fee for each 30 minutes of downtime
100% 99.994%
Terremark vCloud FL, US 7/37.9 Unique $1 for every fifteen 15 minute downtime period up to a maximum amount equal to 50% of the usage fees 100% 99.993%
AWS EC2 EU West 3/36 Percentage
10% invoice credit anytime annual uptime falls below 99.5%
99.5% 99.993%
Speedyrails VPS Canix Quebec 9/38.7 Percentage
3% of monthly fees for every 0.1% of downtime
99.9% 99.992%
Linode Fremont, CA 13/71.92 Pro-rated
1x credit for downtime exceeding 0.1%
99.9% 99.986%
Zerigo CO, CA 9/66.8 Pro-rated
4x the total (starting from 100%, not 99.99%) non-compliant time
99.99% 99.985%
SoftLayer CloudLayer DC, US 31/86.7 Threshold/ Percentage
5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%1 99.984%
SoftLayer CloudLayer WA, US 13/106.8 Threshold/ Percentage
5% monthly invoice credit for each continuous network outage over 30 minutes 20% monthly invoice credit for failed hardware not replaced within 2 hours Max 100% credit
100%1 99.980%
Linode NJ, CA 14/145.7 Pro-rated
1x credit for downtime exceeding 0.1%
99.9% 99.972%
VoxCLOUD NY, US 12/146.33 Percentage
5% invoice credit per 0.1% downtime up to 100%
100% 99.972%
CloudSigma Switzerland 22/59.9 Threshold/ Percentage
50x credit for any downtime (network or hardware) over 15 minutes
100% 99.972%
Hosting.com KY, US 4/38.74 Percentage
1/30th monthly invoice credit for every 30 minutes network downtime 1/30th monthly invoice credit for every 30 minutes hardware downtime after 1 hour buffer for hardware repair
100%1 99.955%
ThePlanet Cloud Servers TX, US 34/144.3 Threshold/ Percentage
5% monthly invoice credit for first 5 minute continuous outage (hardware or network) Then, 5% additional credit for each additional 30 minute continuous outage
100% 99.955%
Gandi VPS France 4/147.7 Pro-rated
1 day credit for every outage over 7 minutes within a single day
99.95% 99.955%
Linode Dallas 21/258.2 Pro-rated
1x credit for downtime exceeding 0.1%
99.9% 99.951%
NewServers FL, US 39/288.7 Pro-rated
24x credit for every 1 hour of downtime exceeding 0.001%
99.999% 99.945%
VPS.NET UK 8/250.3 Percentage
10% monthly invoice credit for each hour of downtime
100%5 99.921%
VPS.NET US Central 12/342.9 Percentage
10% monthly invoice credit for each hour of downtime
100%5 99.892%
Flexiscale UK 83/820.36 Percentage
5% monthly invoice credit for each 30 minutes of downtime
100% 99.844%
VPS.NET US West 32/576.5 Percentage
10% monthly invoice credit for each hour of downtime
100%5 99.819%
ReliaCloud MN, US 23/1941.57 Pro-rated
30x hourly credit for each hour downtime
100% 99.626%
VPS.NET US East 6/1224.18 Percentage
10% monthly invoice credit for each hour of downtime
100%5 99.616%

1 Applies to network connectivity only, not hardware outages

2 Linode does not own or operate this data center (or any of it's data centers to our knowledge). This particular data center in Fremont, CA is owned and operated by Hurricane Electric. About 20 minutes of the outages triggered for this location were due to data center wide power outages completely outside of the control of Linode

3 A majority of this downtime (114 minutes) was due to a SAN failure on 10/15/2010

4 A majority of this downtime (34.5 minutes) was due to an internal network failure on 1/5/2011. We've been told this problem has since been resolved

5 Applies only for clients who have signup for the VPS.net "Managed Support" package ($99/mo). It appears that VPS.net does not provide any SLA guarantees to other customers.

6 Approximately 560 minutes of these outages occurred due to failure of their SAN

7 A majority of these outages (1811 minutes) occurred between Jan-Feb 2010 immediately following ReliaCloud's public launch (post beta). A majority of the downtime seems to have occurred due to SAN failures

8 Explanation provided for approximately 1200 minutes of these outages (2 separate outages) was "We had a problem on the cloud. Now your VPS is up and running"

Is there a correlation between SLA and actual availability?

The short answer based on the data above is absolutely not. Here is how we arrived at this conclusion:

Total # of services analyzed: 38
Services that meet or exceeded SLA: 15/38 [39%]
Services that did not meet SLA: 23/38 [61%]
Vendors with 100% SLAs: 23/38 [61%]
Vendors with 100% SLAs achieving their SLA: 4/23 [17%]
Mean availability of vendors with 100% SLAs: 99.929% [6.22 hrs/yr]
Median availability of vendors with 100% SLAs: 99.982% [1.58 hrs/yr]

It is very interesting to observe that the bottom 6 vendors all provided 100% SLAs, while 3 of the top 7 provide the lowest SLAs of the group (EC2 99.5% and Linode 99.9%). SLAs were only achieved for a minority (39%) of the vendors. This is particularly applicable to vendors with 100% SLAs where only 4 of 23 (17%) actually achieved 100% availability.

Vendors with generous SLA credit policies

In most cases SLA credit policies provide extremely minimal financial recourse not considering all of the hoops you'll have to jump through to get them. Not one of the SLA we reviewed allowed for more than 100% of service fees to be credited. There are a few vendors that stood out by providing relatively generous SLA credit policies:

Some Extra Cool Stuff: Cloud Availability Web Services and RSS Feed

We've recently released web services and an RSS feed to make our availability metrics and monitoring data more accessible. Up to this point, this data was only available on the the Cloud Status Tab of our website. We currently offer 30 different REST and SOAP web services for accessing cloud benchmarking and monitoring data, and vendor information.

Cloud Outages RSS Feed

This feed provides information about potential outages we are currently observing with any of the 90 cloud services we monitor. Click here to view and subscribe to this feed.

getAvailability Web Service

This post includes a small snapshot of the data we maintain on cloud availability. We have released a new web service that allows users to calculate availability and retrieve outage details (including supporting comments) for any of the 90 cloud services we currently monitor. Monitoring for many of these services began between October 2009 and January 2010, but we are also continually adding new services to the list. This web service allows users to calculate availability and retrieve outage information for any time frame, service type, vendor, etc. To get you started, we have provided a few example RESTful request URLs. These example requests return JSON formatted data. To request XML formatted data append &ws-format=xml to any of these URLs. Full API documentation for this web service is provided here. A SOAP WSDL is also provided here. You may invoke this web service for free up to 5 times daily. To purchase a web service token allowing additional web service invocations click here.

Retrieve availability for all IaaS vendors for the past year (first 10 of 46 results)

Retrieve availability for all IaaS vendors for the past year (results 11-20 of 46)

Retrieve availability for all CDNs for 2010 (first 10 of 13 results)

Retrieve availability for all CDNs for 2010 (results 11-13 of 13)

Retrieve availability for all AWS services (EC2, S3, CloudFront) for the past 6 months

Retrieve availability for GoGrid Cloud Servers for the past 2 weeks

Retrieve availability for VPS.net's US East data center since 1/1/2010 - include full outage documentation

Summary

Don't let SLAs lull you into a false sense of security. SLAs are most likely influenced more by marketing and legal wrangling than having any basis in technical merits or precedence. SLAs should not be relied upon as a factor in estimating the stability and reliability of a cloud service or for any form of financial recourse in the event of an outage. Most likely any service credits provided will be a drop in the bucket relative to the reduced customer confidence and lost revenue the outage will cause your business. The only reasonable way to determine the actual reliability of a vendor is to use their service or obtain feedback from existing clients or services such as ours. For example, AWS EC2 maintains the lowest SLA of any IaaS vendor we know of, and yet they provide some of the best actual availability (100% for 2 regions, 99.996% and 99.993%). Beware of the fine print. Many cloud vendors utilize minimum continuous outage thresholds such as 30 minutes or 2 hours (e.g. SoftLayer) before they will issue any service credit regardless of whether or not they have met their SLA. In short, we are of the opinion that SLAs really don't matter much at all.