#Tableau TabJolt Testing - The Light Load

You might just want to pop this open as a distinct window. I’ll be pasting screenshots of my vizzes in, but it’s much more fun to interact.

I tested my “light” load (see the previous post) by running the following tests. Note I mostly executed the (heavier) interact test with a mix of (lighter) view only thrown in for good measure.

I ran each test for 10 minutes, bounced the server then ran the next. The value in the “Users” column represents the max concurrent users I ran during the test.

I got pretty bored running the TabJolt command over and over again, so I whipped up a little set of node scripts that ran on both the Load Generator (TabJolt) machine, and the Tableau Server itself. The script on the Load Generator ran TabJolt and signaled the script on the Tableau Server that “a test just finished”.

The node app on Tableau Server would then cycle Tableau and signal back that the tabadmin restart completed…at which point the next 10 minute test would begin.

You’re welcome to these scripts, but I wrote ‘em even quicker and dirtier than usual…so you’ll need to figure them out yourself. No help from me as I plan to re-write something to support coordinating multiple copies of TabJolt in a few weeks anyway.

OK, to the results, really.

I ran the “set” of tests you see above on each of the configurations I wanted to test. Here’s the high level view of what happened…remember, your results will be completely different based on your workload. Don’t take these numbers as gospel!:

Housekeeping

Are these vizzes part of TabJolt? No. I used Neelesh’s data in PostgreSQL as my own source and did my own thing.
What is an error? I tracked Error % pretty closely, and an error isn’t necessarily bad. TabJolt generally “times out” any request for a viz / interaction that doesn’t come back in 60 seconds. So, about 85% of the time, error = “long running report that eventually came back”. Based on the fact I had a view vizzes (not in this workload) that could take upwards of 30 seconds to run in isolation, you WILL see errors “that aren’t really errors”. Keep that in mind at all times.
The 4 x (4 Cores) v2 configuration (in which I isolated data engines and vizqls) just stunk. I already had guessed this so I stopped the test for that one early when I saw things going poorly. It started with a higher error rate, was slower, and returned less samples. I put it out of it’s misery.
Midway through the 1 x (16 Core) test, the machine ran out of HD space (lots of temp files) because I was miserly with the HD. You’ll see a 75%+ error rate at around 120-130 concurrent users. I stopped the test, did a tabadmin cleanup –restart, and then continued – things settled down again. I should have kept a backgrounder running on this machine to clean up temp files every 30 minutes, Whoops!

Results

For this workload, we had two clear losers:

1 x (16 Cores) : I posit that 2 VizQLs and 1 data engine (even with tons of power behind them) wasn’t enough to clear the “queue” of work that needed to be done. That’s why this machine config performed more slowly and errored first and fast.

4 x (4 Cores) v2: Rather than just sticking a vizql and data engine on each of four machines, I had 2 data machines with a data engine, and 2 with a vizql – for a total of 2 DEs and 2 VizQLs. Clearly not enough.

The Winners were my 2 x (8 Cores) configs. Surprisingly, the v1 config (with only a single data engine) edged out the v2 config (with a data engine on each node, for a total of 2)

OK, gotta run, so quick review of each viz – you can play with ‘em yourself.

Samples vs Response Time

Here you can see some of the “View” tests in action, driving between 50-200 concurrent users. Excellent response time, high volume.

Metric Drill Down

Compare metrics between any combination of concurrent users & test configuration. You can use this viz to discover the “sweet spots” for concurrency on each config. v9 Instant Analytics makes this one fun to play with.

Compare Different Configurations

Using the 4 x (4 Cores) v1 configuration as a baseline, see how each configuration performed. For example, we can see above that the 2 x (8 Cores) v2 delivered on average ~9% more TPS and at times was 20% better. Below, you can see how stinky the “bad” 4 core config was compared to the OK one:

Waiting is a terrible thing

Let’s say I’m a user running a viz along with 9 other folks. What happens to my experience as more and more and more users get loaded onto the system? How much does my wait time (established at the 10 concurrent user level) go up as 30, 40, 100 more users are added to the system?

I’m not sure this viz is particularly relevant against a “light” workload, btw. After all, I probably won’t even really notice the difference if the viz which used to take .41 seconds to render now takes 400% longer and I wait a whopping 1.6 seconds. No big whoop unless you’re the Flash.

What do slow vizzes look like?

Very simple viz displaying the 95 percentile render times of vizzes in each workload. You can use this to see what the “slowest of the slow” looks like in each configuration, and when it starts occurring.

OK…gotta run. More thoughts on this later.