This section briefly examines joins, purely from a performance-tuning perspective. There are various different types of joins. Some join types can be classified with inherently good performance. Some join types might need a little manual help. Some join types can be stubbornly difficult to tune. Different join types and their tuning attributes can be described as follows:
- Inner join: An inner join is an intersection between two tables. The join is usually performed between two referential integrity keys in those two tables. Intersections are the most efficient types of joins because they match records between two tables based on equality (an = sign). The following query joins the PUBLISHER and EDITION tables, based on the primary and foreign key link between the two tables (one-to-many relationship). The tables are shown in Figure 8-1.
SELECT P.NAME AS Publisher, E.ISBN
FROM PUBLISHER P JOIN EDITION E USING (PUBLISHER_ID);
A self-join is a special type of intersection where records on multiple hierarchical levels, stored in a single table, are returned as a hierarchical structure.
- Outer join: An inner join returns an intersection between two tables. An outer join returns the opposite of an inner join. An outer join returns all records in one table, which are excluded from the other table. Profligate use of outer joins in queries could indicate a possibly over-granular database model. Tuning outer joins is inherently more difficult than tuning inner joins because outer joins are more complex to write. The following query returns the intersection between PUBLISHER and EDITION, as well as all publishers currently with no titles in print (the outer part of the outer join):
SELECT P.NAME AS Publisher, E.ISBN
FROM PUBLISHER P LEFT OUTER JOIN EDITION E USING (PUBLISHER_ID);
- Cross join: Across join is a Cartesian product, joining every record in one table to every record in another table, regardless of any meaningful referential integrity connection of keys between the two tables, or any other field connection. Obviously, cross joins will be slow, depending on the number of records in both tables, relative to each other, and in total number of records. The following query returns all publishers with all editions regardless of any PUBLICATION to EDITION relationship. In other words, every book is returned with every publisher, regardless of who published the book. Even if a particular publisher did not publish a particular book, a record is still returned containing completely unrelated publishers and books (those books belonging to other publishers, or perhaps even no publisher whatsoever—self-published books).
SELECT P.NAME AS Publisher, E.ISBN FROM PUBLISHER P CROSS JOIN EDITION E;
The most important factor for performance tuning joins is minimizing the number of tables in a join, in all types of database models.
Any poorly coded join will likely be inefficient. In data warehouses, inefficient joins can easily have disastrous performance effects because of the sheer quantities of data. OLTP databases can also be crippled to the point of uselessness, but it is less likely because of smaller physical database size. One particular project I contracted on in the past, however, had a miniscule 10 GB database and joins composed of 15 tables or more in each query. Short Web page listings of 10 records were taking 30 seconds to return a response to their customers. After I spent two months persuading developers how to speed up their software, they took about a week to fix. The Web site subsequently turned those 30-second waits into less than a half a second. Internet surfers typically lose interest in Web sites taking longer than seven seconds to react.
There are some basic rules when attempting to tune SQL code joins:
- Apply largest filters first: Filter from largest tables first to reduce records joined. Retrieve tables in order from the most highly filtered table downward, preferably the largest table has the most filtering applied. It is essential to reduce record numbers from large tables as much as possible before joining them to other tables. The most highly filtered table is the table having the smallest percentage of its records retrieved; preferably, the largest table is filtered the most.
- Use indexes: Try to write code to utilize indexes wherever possible, except for very small tables. Small tables can sometimes be more efficiently read, ignoring any indexing on the small table. Small tables are often read by ignoring indexes. Sometimes even large tables are best read by ignoring indexing, especially in cases where a larger percentage of the bigger table is read.
- Nested subquery semi-joins: It is possible to tune or merely simplify joins (for easier tuning) by using nested layers of subqueries. This type of tuning is much more applicable to highly normalized OLTP database models, and does not apply to denormalized data warehouse database models.
The above tip was excerpted from Chapter 8, 'Building Fast-Performing Data Models,' from the book "Beginning Database Design" by Gavin Powell, courtesy of WROX Publishing. Click here for the complete collection of book excerpts.
This was first published in April 2006