Behold! The glory that is Burst Balance!
As you know, gp2 EBS storage is pretty awesome for Tableau. Fast. Relatively cheap.
That is, of course until you empty your bag of gp2 “Performance Boost” tokens. Then, things can get a bit slower. And by a bit, I mean a hell of a lot slower.
Until recently, there was no real way to track how much of your “bucket of fast” you still had left to draw on. AWS didn’t give us a CloudWatch metric to track this important value, so you kind of had to guess and/or learn from experience.
That problem is now solved. Back in November (I’m a slow reader), this metric was added. You’ll find it selecting the block device attached to your ec2 instance, then drilling on through to the EBS volume itself:
Here are the CloudWatch metrics for the disk in question, and the very last one represents a “recovering” disk which is in the process of re-generating a pool of boost tokens after I used ’em all up.
Let’s run a simple little test, shall we? On an 8-core ec2 instance (16 vCPU m4.4xlarge), I’ll fire up 55 concurrent users who are constantly accessing the server and executing vizzes.
For the first 25-26 minutes of the test, I just let everything run as-is. Note how during that time we had more tests executed with a lower response time than the second half of the hour.
What’d I do? I hit the D: drive with FIO, (a free disk test & stress tool). I generated about 3000 IOPS, which is pretty much the maximum that the gp2 disk will burst to while it has performance boost tokens to burn.
During this time, both disk latency and average disk queue jumped to unacceptable levels on drive D:
The ~1600 second mark is right about where tests per minute dipped below the hourly average and response time increased.
At this point, we’re using up performance boost tokens like nobody’s business: I exhaust my bucket of “fast” in about 30 minutes. I wish I had run my load test a little bit longer, because it would have been very interesting to see what Tableau did when fio continued to bang on a disk which had no possible way to keep up….but I wasn’t thinking ahead.
So now we’ve seen the impact of heavy disk usage and we’ve exhausted our performance boost tokens. I really DID want to see what Tableau looked like without “boost” to fall back on, however. What happens when you can no longer burst as necessary? I ‘d guess “You’re slow”. Let’s prove it.
I rebooted the test machine and started a new test while the burst bucket was still depleted. Here’s what I saw:
For the first ten minutes or so, my disk was still having problems. Disk Latency was high (always above 18-20ms) and I had a very long disk queue. About 8-10 minutes into the test, the bust bucket started recovering – adding some “free fast”. While it’s a bit hard to see in the chart below, at 5:05 AM I went “positive” in terms of having some burst capacity.
The drive continued to recover capacity, and you can see the recovery speeds up after the load test completed, which is about half-way through the time series.
What did Tableau test results look like? You won’t be surprised.
The load test itself began at 4:54. Note the terrible results until…you guessed it…about 5:05 AM. Once Tableau got the disk it needed, we went from 100-300 tests per minute to at least double that.
So campers, today we experienced another object lesson that “disk matters”. We also added another tool to our toolbox around how to monitor this stuff.