Writing Efficient Queries
Efficient SQL code is primarily about efficient queries using the SELECT command. The SELECT command allows use of a WHERE clause, reducing the amount of data read. The WHERE clause is used to return (or not return) specific records. The UPDATE and DELETE commands can also have a WHERE clause and, thus, they can also be performance-tuned with respect to WHERE clause use, reducing the amount of data accessed.
Performance tuning of the INSERT command to add records to a database is often the job of both developers and administrators. This is because end-users usually add data to a database through the use of applications. Metadata change commands such as (CREATE TABLE and ALTER TABLE) are more database administration. Thus, INSERT commands and metadata commands and not relevant to this text.In an OLTP (transactional) database, small transactions and high concurrency are the most important aspects. Accuracy of SQL code and matching indexes is critical. In data warehouses, large queries and batch updates are prevalent. Therefore, in data warehouses, large complex queries are executed against as few tables as possible, minimizing on the number of tables in join queries. Joining too many tables at once in a query can have the most significant impact on query performance of all, in both OLTP and data warehouse databases. Data warehouses simply exacerbate problems because of huge data quantities.
There are some general philosophical rules of thumb to follow when performance-tuning SQL code:
- Database model design supports SQL code: The quality of SQL code depends completely on the quality of database model design, not only from a perspective of correct levels of normalization and denormalization, but also from the point of view of using appropriate structures. For example, a data warehouse database model design is needed for a data warehouse because over-normalized, granular, deep normal form tables, often used in OLTP databases, are completely inappropriate to the very large transactions, across many tables, required by data warehouses.
- The KISS Rule (Keep It Simple and Stupid): Any type of program code broken into simple, (preferably independent) pieces is always easier "to wrap your head around." Simple SQL commands with simple clauses are easy to write and easy to tune. Longer and more complicated queries are more difficult to write, and it's more difficult to get them to produce the proper results. Performance tuning is an additional step. If you have to tune some big nasty queries because they are running too slow, well, what can I say? If you had kept it simple, making them run faster would probably be a lot easier, and a lot more possible. Simplify first if overcomplexity is an issue. In the very least, simplicity can help you understand precisely what a query is doing, without giving you a headache just staring at lines of meaningless SQL code.
- Good table structure allows for easy construction of SQL code: Be aware of anything controlling the way or manner in which SQL code is constructed and written, other than, of course, the database model. In an ideal table structure, SQL code should be written directly from those table structures, or as subsets of it, not the other way around. Writing SQL code should not be difficult. You should not get a constant impression (a nagging doubt or hunch) that table structure doesn't quite fit. The structure of the database model should make for easy of SQL code construction. After all, SQL code supports applications. Don't forget that SQL code rests on the database table structure. If there is any kind of mismatch between application requirements and database structure, there is likely something wrong with the database model. Performance tuning SQL code in a situation such as this will likely be a serious problem.
- Breaking down into the smallest pieces: Break down the construction of SQL code commands, such as queries and DML commands (INSERT, UPDATE, and DELETE). Do not break down nonquery and non-DML type commands. For example, do not continually connect and disconnect from a database for every single code snippet of SQL database access executed. Either connect for a period of time, for each user, or connect at the start and end of sessions. On the other hand, make extensive use of subqueries if it helps to making coding easier. You can always merge subqueries back into the parent query later on.
There are a set of specific ways in which the most basic elements of SQL code can be constructed to ensure good processing performance. There are a number of general areas that are important to the most basic rules of query performance tuning. Examine how each factor is affected by the underlying structure of the database model:
- The SELECT command: This includes how many tables are involved in SELECT commands. These factors have a highly significant impact on performance of queries. The more granular a database model, the more tables retrieved from at once. The manner in which fields are retrieved can also affect performance, but table numbers in joins are more significant, especially in larger databases.
- The WHERE clause: This includes how records are filtered. Comparison conditions dictate that a WHERE clause is applied to records, such as only to retrieve records with the vowel "a" in an author's name. A comparison condition is the main factor determining the construction of a WHERE clause. There are different types of comparison conditions. The manner in which records are filtered in a query can affect the way in which a query executes. The result is a highly significant impact on performance. Indexing has a very significant affect on how well WHERE clause filtering performs.
- The GROUP BY clause: The GROUP BY clause is used to aggregate records into summarized groups of records retrieved from a database. Groupings are best achieved as a direct mapping onto one-to-many relationships between tables. Materialized views are commonly used in data warehouse to pre-calculate and pre-store GROUP BY clause aggregations.
- Joins: A join query retrieves records from multiple tables, joining tables based on related field values between those tables. Typically, relationships are based on referential integrity established between primary and foreign keys, in two tables. Perhaps the most significant factor in making queries execute at an acceptable speed is how tables are joined, and how many tables are in joins (as stated previously). When considering a database model design, the more granular a database model is (the more tables you have and the more it is broken down into small pieces), the larger the number of tables will be in join queries. In a data warehouse, this is generally much more significant because data warehouses contain huge volumes of data; however, even in OTLP databases, with a multitude of miniscule-sized transactions, large joins with ten or more tables can kill performance just as effectively as more than two large tables for a join query in a data warehouse.
The previous tip was excerpted from Chapter 8, 'Building Fast-Performing Data Models,' from the book "Beginning Database Design" by Gavin Powell. Click here for the complete collection of book excerpts.
This was first published in April 2006