In a previous post, we took a look at the different types of EC2 storage and how they impact the overall performance of whatever EC2 instance type you select.
Now, we’ll look at the instance types themselves and focus on rendering performance for two and ten concurrent users executing a fairly basic workload.
The two concurrent user tests attempt to imitate the work you might see on a small, twenty-five user Tableau installation with an (about) ten percent concurrency rate. The ten concurrent user workload test imitates a sub-one-hundred-user Tableau Server
Ten percent is generally considered a “good enough” (and conservatively high-ish) guestimate of what concurrency might look like in an on-premise enterprise implementation of Tableau. In a SaaS scenario, you’ll see a much lower number – somewhere between one and five percent. FYI, I’ve personally never actually seen 5%+ myself in SaaS-land.
The exact mix of reports in my test workload are detailed in the previous post, so if you haven’t given it a look, you probably should.
Need a reminder of the instances I tested? Here:
EC2 Instance Types:
- c3.xlarge (4 virtual vCPUs, 7.5 GB RAM)
- c3.2xlarge (8 virtual vCPUs, 15 GB RAM)
- c3.4xlarge (16 virtual vCPUs, 30 GB RAM – basic “quick” test only)
- m3.xlarge (4 virtual vCPUs, 15 GB RAM)
- m3.2xlarge (8 virtual vCPUs, 30 GB RAM)
Obligatory warning before you read any more: My workload is different than yours. So your results will be different. The goal of this exercise is to see how the same workload varies in performance across different instance & storage types…not to predict how your workload will behave on these combinations. It is up to you to do your own testing if you need something prescriptive.
You really should read this article from start-to-finish (and the previous post) since the whole idea is to teach yourself how to fish. But, if you’re in a hurry and are comfortable winging it, here’s the top line:
c3.xlarge or m3.xlarge for two concurrent users running only a very, very basic workload. c3.2xlarge or m3.2xlarge for everything else up to around ten concurrent users
- For two concurrent users running only a simple workload, I found that either the m3.xlarge or c3.xlarge were OK-ish. Mostly. They rendered basic reports only ~ 1 second slower than beefier EC2 instances.
- Even 2 concurrent users running “big” reports are enough to slow down xlarge rendering significantly.
- On average, an m3.xlarge rendered 51% more slowly than a m3.2xlarge
- A c3.xlarge rendered (on average) 77% slower than the same report executed on a c3.2xlarge
- xlarge machines offer unacceptable performance for ten concurrent user workloads. The difference between initial rendering and average rendering time is approximately 100% on 2xlarge instances. On xlarge instances, the delta is more like 500%+
It’s not news that a virtualized core or CPU generally doesn’t offer the same performance as its physical counterpart. EC2 is no different in this regard. If you read information on http://aws.amazon.com/EC2/instance-types/ , you’ll see that:
Each vCPU is a hyperthread of an Intel Xeon core for M3, C3, R3, HS1, G2, and I2
In general, one vCPU more-or-less correlates to half a physical core (actually, maybe a touch more based on how things are implemented). I’d suggest you read a great third party post on this subject here:
http://www.pythian.com/blog/virtual-cpus-with-amazon-web-services/
So, based on how AWS implements virtual CPUs right at this moment, a quick and dirty correlation is 2 EC2 vCPUs = 1 physical core. Note how I’m using the world “correlation” over and over again. I’m doing that on purpose. Also keep in mind that AWS could change the way this stuff works, so this quick and dirty guestimate could change based on how hardware capability evolves and improves over time.
And I know what you’re thinking right about now…something along these lines, I bet:
Hold on, I bought X cores of <any software product that does hardware-based licensing> and if I run them on top of X vCPUs in EC2 I’m essentially getting the horsepower of X/2 physical CPUs? Aren’t I wasting my investment in the software?
...That would be a “Yes” in my humble opinion. Named User licensing is probably your friend in this sort of scenario – you can throw as much hardware at your named users as you wish.
EDIT: 24-November, 2014:
<edit begins>
Jesse Sturges, an SC here at Tableau has been working with a customer implementing core licensing on AWS. He and his customer discovered that the Windows OS running in EC2 reports only ½ the EC vCPUs being run in an instance as cores. Meaning if you run a 16 vCPU instance, the OS (and therefore Tableau) sees 8 cores from a licensing perspective. If you run 8 vCPUs, we see and need to license 4 cores, and so on.
We talked to some of the fine folks over at AWS and they confirmed the behavior. Great news!
Essentially, by “doubling up” on virtual CPU in EC2, you can approximate the (CPU) performance of Tableau on physical hardware, and won’t be penalized by our licensing logic demanding you buy 2x the number of Tableau cores to run that way. This’ll let you get the most bang for your Tableau software buck as possible.
</edit ends>
So now, let’s look at some numbers!!
Note: I’ll post all the numbers so you can tinker with them yourself at the end of this post. I’m going to compare instance types using the same baseline storage solution of two striped general purpose SSD disks. Read part one of this article if you want to know why (good price-to-performance as long as you’re not executing disk-centric workloads – like using lots of extracts)
Anyway, two concurrent users “doing things on Tableau at the same time” probably equates to an overall user community of about twenty to thirty – a small server.
We’ll start with a complex viz being rendered. This is the sucker that uses a 200M row extract which focuses on stock transactions. The dashboard itself looks like this:
Several metrics are being calculated across time, we have a few other dimensions in play, and we even do some data blending. Here’s what the performance recorder output looks like…we’re doing a fair amount of work:
As you’re about to see, the 2xl compute-optimized c3 and general purpose m3 instances complete rendering of this dashboard twice as fast as the xl instances:
It’s important to note that I am attempting to filter out cached renders in the view above. I’m doing this because I don’t want the average elapsed render time to get artificially pulled down by very-fast cached executions. I’m trying to remove “smoke and mirrors” speediness in order to get a true feel for how long these things take to execute.
If I don’t filter out all the (fully cached) sub-400 millisecond renders, we get a very different picture:
You’ll see that the average elapsed render time looks pretty much the same across all hardware. If all of your users were lucky enough to hit a cached copy of the viz, they’d be happy on “low end” instances…but we know they won’t, so don’t try and go cheap on hardware hoping that caching will make up for an under-provisioned server. It won’t.
Care to have me drive home the point about not using magnetic disks as storage? (I guess you COULD stripe a ton of them, but I’m not a big enough masochist to actually do this for a blog posting):
Note how the relatively quick renders we were getting on the 2xl instances are now taking twice as long. Rendering of the dashboard on the xl machines slows a bit too, but the difference isn’t as dramatic. This shows us that the bottleneck for rendering the viz on an xl instance is more CPU than disk.
Finally, what does a purely “simple” rendering workload look like?
Gotta admit it, I’m a bit surprised. The lowly xl instances actually are marginally faster than their big-city cousins. Why? I have no idea.
But when all is said and done, I still probably wouldn’t go into production with xls since I don’t think anyone’s workload will always consist of two people hitting Tableau’s sample reports at the same time 🙂
Let’s look at our tests run with ten concurrent users.
No surprise, the 2xl machines have average and max render times of less than half vs. the xl instances.
Ten users doing “simple things” at the same time also create a scenario where the xl instances fall down. Look at the nice tight groupings on the 2xl instances below and compare to the wide variance on xl machines. There is a 10x difference in maximum render time between the 2xl and xl m3 instances. Wowsers!
For kicks, I did some very basic testing with a c3 4xl instance (16 vCPUs) and 3 striped GP disks, too. While results were better than the 2xl, they weren’t ALL that much better:
- c3 2xl and 4xl faster than m3 2xl
- c3 2xl with 3 striped disks marginally quicker than c3 2xl with 2 striped disks
- c3 4xl with 3 striped disks faster still than the c3 4xl & 2 striped disk combo
So we see we’re getting a small bump from that 3rd disk and another bump having extra cores available, I guess.
Here’s a summary of what we’ve learned:
- m3 and c3 xl instances are only good for very simple workloads with not many (two) concurrent users
- If you have semi-complex+ dashboards or more than a couple users hitting your machine, you need at least a 2xl instance.
Other thoughts:
- Don’t even CONSIDER taking that 8 core license you bought and splitting it up as two 4-core instances on EC2. I know, I know…you’re going to say But, I need HA". Having a highly available system that doesn’t perform worth a damn means you have a highly available system that no one will use. Put those 8 vCPUs on a single instance, thank you. (This also applies to other VM solutions and even on physical hardware – don’t split an 8 core.)
I’ve embedded a viz with all my results in StoryPoint form right here. Feel free to browse / filter and download as you please. There are also screenshots of perfmon measuring interesting counters on each of the tests I tried. You’ll find those beneath the viz.