When you join multiple tables in a data source, Tableau has a nifty (and generally invisible to the user) functionality called “join culling”. Since joins cost time and resources to process on the database server, we really don’t want to use every join that we declared in our data source all the time.
For example, let’s say I have a single (fact) table with my sales in it – and that table is joined to a customers (dimension) table, and a date (dimension) table, too.
If I add the sales measure from my sales table to a sheet by itself, do I really need any information from the customers or date tables?
No, I don’t.
So do we want the query Tableau issues to the database to include JOIN statements against those tables?
No, we don’t – We’d be wasting the database’s time, and potentially end up with a viz that takes longer to render because we’re asking the database to do something which is inefficient and slower as a result. Some database engines may be smart enough to optimize this not very good query and ignore our request for unnecessary joining, others won’t.
Another example: I decide to use that date (dimension) table in a quick filter…when we populate the days/months/years/whatever in the quick filter, we don’t care about sales or customers (at least not out of the gate)…so we don’t want JOINS here, either.
Anyhow, I spent a few days figuring out how this works for myself with the help of my colleagues (Marc, Matt, Jeff, Ty – you rock!) .
I thought I’d summarize what I found for the rest of the community.
I chose to use SQL Server in my experiments, and I created a few simple tables to play with:
CREATE TABLE DimPets
(
PetID int PRIMARY KEY,
PetName VarChar(10)
)
GO
CREATE Table FactPets
(
Date DateTime,
PetID int,
Units int
)
GO
–don’t add the Check Constraint below, and interesting things will happen!
ALTER TABLE [dbo].[FactPets] WITH CHECK ADD CONSTRAINT [FK_FactPets_DimPets] FOREIGN KEY([PetID])
REFERENCES [dbo].[DimPets] ([PetID])
GO
And of course, we need some rows:
INSERT DimPets Values (1, ‘Cat’)
INSERT DimPets Values (2, ‘Dog’)
INSERT DimPets Values (3, ‘Fish’)
INSERT FactPets Values(‘1/1/2012’, 1, 10)
INSERT FactPets Values(‘1/1/2012’, 2, 5)
INSERT FactPets Values(‘½/2012’, 3, 7)
INSERT FactPets Values(‘½/2012’, 2, 8)
INSERT FactPets Values(‘½/2012’, 1, 16)
The next part involved creating a data source against these tables and creating a join between FactPets.PetID and DimPets.PetID. For giggles, I tried using a right, left, and inner join at various times:
The very first thing I learned (which was also mentioned in a forum posting I found afterwards) is that referential integrity matters. You must have relationships setup in your database between the tables you’ll be joining in Tableau. If you don’t, Tableau can’t do join culling no matter what sort of join(s) you create inside your data source.
What does this mean to you? If you are consuming database views that have been created for you by IT, then kiss join culling goodbye. At least in SQL Server, you can’t setup primary key / foreign key relationships between views which means Tableau will need to issue a query containing every JOIN you defined in your data source. Ouch! That could get expensive if you have 10-15 tables in your data source. Extracts are going to start looking good really fast!
Using SQL Profiler, here’s what I saw when I just dropped [Units] into a viz and I didn’t have my relationship setup in the SQL database:
SELECT SUM(CAST([FactPets].[Units] as bigint)) AS [sum:Units:qk]
FROM [dbo].[FactPets] [FactPets]
INNER JOIN [dbo].[DimPets] [DimPets] ON ([FactPets].[PetID] = [DimPets].[PetID])
HAVING (COUNT_BIG(1) > 0)
Note the completely unnecessary join. Next, I added my relationship back in:
ALTER TABLE [dbo].[FactPets] WITH CHECK ADD CONSTRAINT [FK_FactPets_DimPets] FOREIGN KEY([PetID])
…and when I tried the same experiment, I saw this:
SELECT SUM(CAST([FactPets].[Units] as bigint)) AS [sum:Units:qk]
FROM [dbo].[FactPets] [FactPets]
HAVING (COUNT_BIG(1) > 0)
Voila! Join culling!
Next, with the relationship still in place, I modified the JOIN type in the Tableau data source to LEFT:
What do we see when we add [Units] into the viz by itself?:
SELECT SUM(CAST([FactPets].[Units] as bigint)) AS [sum:Units:qk]
FROM [dbo].[FactPets] [FactPets]
LEFT JOIN [dbo].[DimPets] [DimPets] ON ([FactPets].[PetID] = [DimPets].[PetID])
HAVING (COUNT_BIG(1) > 0)
Whoops. No more join culling. The same thing happened with a RIGHT join, too.
So, second lesson – In the Tableau data source, inner Joins are good for join culling. The other types? Not so much.
The final thing I picked up involved a fairly complex scenario with lots of tables all joined together in Tableau. In this particular situation, I had an Orders (fact) table associated with a Date (dimension) table.
The Orders table had multiple relationships against the Date table because it contained an [OrderDate] field, a [ShippedDate] Field, and a [DueDate] field. Each one of these columns in the Orders fact table had a relationship pointed at the primary key in the Date table.
This situation confused the join culling logic. With three relationships in place between the same tables, Tableau went back to including JOIN clauses in every query, even if I had defined each as an INNER join inside the Tableau data source. When I removed 2 of the 3 relationships in the database, Tableau starting culling again. When I replaced them, no more culling. Lesson: too many relationships aren’t necessarily a good thing (although I personally think this is a bug).
Update (Q4 2012): Learned a bit more about the “by design” behavior of Join Culling I thought I’d pass along.
All of the above still holds true, with one caveat. As designed, join culling only works when we’re dealing with a simple-ish star schema. Join culling won’t operate a snowflaked schema where we do something like join the Products table to the Product Subcategory table to the Product Category table. All tables to “be culled” must be related directly to the fact table in true star schema style.
Update (Q3, 2013): Read this.