In our business databases, we know different hierarchies. They might involve classes of salable items, narrowing in scope from broad categories to more specific categories, and eventually resolving into SKUs. The most common example within a database context is probably that of employees within an organization. Virtually every firm has some sort of organizational chart and -- matrix and other nouveau management techniques not withstanding -- these charts are hierarchical in nature. Modeling them seems so very simple.
This simple technique for modeling hierarchies is known as an adjacency list:
CREATE TABLE Employee ( EmployeeID INT NOT NULL PRIMARY KEY, ManagerID INT REFERENCES Employee (EmployeeID) )
It's the most straightforward way of representing a tree. Yet, as most database professionals have discovered, these seemingly simplistic data structures can be maddeningly
SELECT COALESCE(Level5, Level4, Level3, Level2, Level1) AS EmployeeID, CASE COALESCE(Level5, Level4, Level3, Level2, Level1) WHEN Level1 THEN 1 WHEN Level2 THEN 2 WHEN Level3 THEN 3 WHEN Level4 THEN 4 WHEN Level5 THEN 5 END AS Level FROM ( SELECT CEO.EmployeeID AS Level1, CASE WHEN CEO.EmployeeID = SeniorManagement.EmployeeID THEN NULL ELSE SeniorManagement.EmployeeID END AS Level2, CASE WHEN SeniorManagement.EmployeeID = MiddleManagement.EmployeeID THEN NULL ELSE MiddleManagement.EmployeeID END AS Level3, CASE WHEN MiddleManagement.EmployeeID = JuniorManagement.EmployeeID THEN NULL ELSE JuniorManagement.EmployeeID END AS Level4, CASE WHEN JuniorManagement.EmployeeID = PeopleWhoDoWork.EmployeeID THEN NULL ELSE PeopleWhoDoWork.EmployeeID END AS Level5 FROM HumanResources.Employee CEO LEFT JOIN HumanResources.Employee SeniorManagement ON SeniorManagement.ManagerID = CEO.EmployeeID OR CEO.EmployeeID = SeniorManagement.EmployeeID LEFT JOIN HumanResources.Employee MiddleManagement ON (MiddleManagement.ManagerID = SeniorManagement.EmployeeID OR SeniorManagement.EmployeeID = MiddleManagement.EmployeeID) AND CEO.EmployeeID <> SeniorManagement.EmployeeID LEFT JOIN HumanResources.Employee JuniorManagement ON (JuniorManagement.ManagerID = MiddleManagement.EmployeeID OR MiddleManagement.EmployeeID = JuniorManagement.EmployeeID) AND MiddleManagement.EmployeeID <> SeniorManagement.EmployeeID LEFT JOIN HumanResources.Employee PeopleWhoDoWork ON (PeopleWhoDoWork.ManagerID = JuniorManagement.EmployeeID OR JuniorManagement.EmployeeID = PeopleWhoDoWork.EmployeeID) AND JuniorManagement.EmployeeID <> MiddleManagement.EmployeeID WHERE CEO.ManagerID IS NULL ) ManagementTree ORDER BY ManagementTree.Level1, ManagementTree.Level2, ManagementTree.Level3, ManagementTree.Level4, ManagementTree.Level5
This query is obviously quite complex -- but it's also nowhere near perfect. The main issue, aside from that of readability, is maintainability. What will happen if senior management decides that a new level of managers-in-training should be added between the junior managers (see the correlation name JuniorManagement) and the lowest level employees (using the correlation name PeopleWhoDoWork)? Other than making working conditions even worse for those on the bottom rungs, this decision will mean that a developer will have to edit this query, adding in yet another join to the Employee table in order to facilitate getting the data for the new level. This is hardly an optimal solution.
Getting around adjacency list limitations
A variety of techniques have been created for getting around these and other limitations of adjacency lists. Itzik Ben-Gan wrote an excellent article on representing hierarchies using materialized paths. Joe Celko popularized a technique called the Nested Sets Model. And more recently, Vadim Tropashko created a technique called the Nested Intervals Model. Although each of these techniques greatly improves upon adjacency lists when it comes to querying, none are as simple or easy to manage when it comes to actually populating and maintaining the data.
The best solution would be one that would allow data modelers to make use of the simplicity of adjacency lists, but ease the pain of querying them. Recursive Common Table Expressions (CTEs) are a new feature in SQL Server 2005 that meets these requirements. A common table expression is a construct very similar to a derived table. The following two queries, the first using a derived table and the second a (non-recursive) CTE, are functionally equivalent:
SELECT AVG(BonusTotal) FROM ( SELECT SUM(Bonus) AS BonusTotal FROM Sales.SalesPerson GROUP BY TerritoryID ) SalesBonusTotals WITH SalesBonusTotals AS ( SELECT SUM(Bonus) AS BonusTotal FROM Sales.SalesPerson GROUP BY TerritoryID ) SELECT AVG(BonusTotal) FROM SalesBonusTotals
Both of these queries compute the average total bonus per territory of the sales team in the Adventure Works database. The common table expression in the second query acts just like the derived table in the first, and can be referenced by its correlation name (SalesBonusTotals) throughout the query. In this way, derived tables and CTEs are virtually identical.
However, CTEs have a feature that derived tables do not: The ability to recursively refer to themselves. This recursion is specified by using a UNION ALL within the CTE. The second query in the UNION ALL should refer to the CTE itself, using it as a key to the next level of recursion. This is best shown by example. The following query can be used to find the EmployeeID and ManagerID of the CEO of Adventure Works:
SELECT EmployeeID FROM HumanResources.Employee WHERE ManagerID IS NULL
We know that any employee without a manager must be the CEO of the firm -- the top level of the hierarchy represented in the Employee table. To find all of the subordinates of this top level employee (every employee in the company), we can the following recursive CTE:
WITH EmployeeCTE AS ( SELECT EmployeeID FROM HumanResources.Employee WHERE ManagerID IS NULL UNION ALL SELECT E.EmployeeID FROM HumanResources.Employee E JOIN EmployeeCTE x ON x.EmployeeID = E.ManagerID ) SELECT EmployeeID FROM EmployeeCTE
The results of this query aren't too interesting; all it does is display a list of EmployeeIDs from the Employee table. But what is interesting is what it's actually doing. The query first gets the CEO's EmployeeID, then calls the recursive piece, querying for every employee managed by the CEO (x.EmployeeID = E.ManagerID). For each row returned by that recursive call, the same logic is repeated again and again until the recursion is terminated by a lack of rows returned in the recursive step.
This query can be extended. For instance, we might want to know the level of each employee within the hierarchy:
WITH EmployeeCTE AS ( SELECT EmployeeID, 1 AS Level FROM HumanResources.Employee WHERE ManagerID IS NULL UNION ALL SELECT E.EmployeeID, x.Level + 1 AS Level FROM HumanResources.Employee E JOIN EmployeeCTE x ON x.EmployeeID = E.ManagerID ) SELECT EmployeeID, Level FROM EmployeeCTE
In this case, the level of the CEO defaults to 1. Each time the recursive step is called, the level will be incremented for those rows.
At this point, the query is returning almost the same data as the flawed query above, yet it is much more readable and will require no maintenance when management makes an asinine decision. The query will dynamically adapt to new levels in the hierarchy without a problem.
The only issue left to solve is that the order of the rows returned by this query may not be quite right for use in an organizational chart. The rows should be returned in the order as if one were walking the tree from the root (the CEO), and visiting each node (employee). This can be solved by taking a cue from Ben-Gan's article and materializing a path as the CTE works:
WITH EmployeeCTE AS ( SELECT EmployeeID, 1 AS Level, CONVERT(VARBINARY(MAX), EmployeeID) AS thePath FROM HumanResources.Employee WHERE ManagerID IS NULL UNION ALL SELECT E.EmployeeID, x.Level + 1 AS Level, x.thePath + CONVERT(VARBINARY(MAX), E.EmployeeID) AS thePath FROM HumanResources.Employee E JOIN EmployeeCTE x ON x.EmployeeID = E.ManagerID ) SELECT EmployeeID, Level FROM EmployeeCTE ORDER BY thePath
By converting the EmployeeID into binary format and appending on each recursion, the query forms a trail of visited nodes. Each path representation will begin with 0x0000006D, the binary representation of 109, the Adventure Works CEO's EmployeeID. Following that will be the binary for each EmployeeID down the hierarchy, thereby creating an ordered representation that can be used in the ORDER BY clause to produce the correct results.
Using this technique, any tree or hierarchy represented by an adjacency list can easily be navigated, without using complex techniques such as the Nested Sets or Nested Intervals models. Given that hierarchical data makes up such a large part of our world, this is a very powerful feature for SQL Server 2005!
About the author: Adam Machanic is a senior database engineer for GetConnected Inc. He has several years of experience developing a variety of applications using SQL Server as a data repository and is active on numerous online technical forums. He is a Microsoft Certified Professional and a SQL Server MVP. Machanic, who also serves as our SQL Server 2005 expert, welcomes your questions.
More information from SearchSQLServer.com
This was first published in July 2005