I’ve downloaded, massaged, and loaded the full FAA On-Time dataset from http://www.transtats.bts.gov/ to a test Redshift cluster. It’s about 150M rows and covers 1989 through July of 2016.
I’m currently running a 5-node cluster and will add / remove nodes at random — sometimes running as few as 1-2, sometimes running as many as 32. The cluster is located in Singapore, so there may be a tiny bit of lag between you and me.
Here is a sample v10 workbook which includes a live data source pointing to the cluster in question. The username for this cluster is iloveredshift and the password is abcD1234. Do as you wish.
Why in the world am I doing this? I need (you?) to generate “real world” queries against a cluster that I can monitor with a little toy I’m working on. I figure you heathens can be much more creative than I 🙂
FYI, the database itself isn’t very optimized at this point, so don’t take it as an indication of performance you can expect with your well-designed database on Redshift. The main dashboard will take at least 30 seconds to render when the cluster is running 1-2 nodes. We’ll see how fast it runs when I add more juice. I also am bumping up the concurrency on this puppy to 15 concurrent queries, so if 2-3 of you happen to start banging on it at the same time, you’ll be able to tell.
I’ll leave this sucker up and running for a week or two. If you create something cool or discover something awesome in this dataset, let me know, or post it on Tableau public!
BTW – be aware that this data source contains a customization that causes no cursors to be used on Redshift. What this means to you is that you don’t want to create an extract using the “embedded” data source (unless you have a HECK of a lot of memory on your machine). If you want an use this cluster to try and grab an extract, that’s cool, but I’d advise you to create a NEW data source without a customization first. Don’t know what the hell I’m talking about? Here you go!
I’d also recommend you do SOME aggregation in the extract, or you could inadvertently blow out the “temp space” that this cluster allows for cursors. If you do that, your extract will fail.
BTW #2: unless you have 35+ minutes and about 15 GB of RAM (for Tableau) to spare, don’t run the worksheet that says “Don’t run me”. For real. It takes every airplane in the US on each day it flies, groups it by the airline if flies for, and then clusters the result based on arrival and departure delay. I just wanted to see what happens..and I don’t know what happens yet. I’m pretty excited. [Edit: It worked. 17M+ marks]
Russell is CURRENTLY running this many nodes:
Five
Two
Six
One
SEVEN
Three
TEN
Can you make these numbers go higher?
You must understand that your “Don’t run me” sheet is like telling someone “Whatever you do, DO NOT press the red button.” Now that’s the only button I want to press. It calls to me.
Well, yes.