As you develop SQL Server applications that hundreds of users will tap into at once -- using databases with up to millions of rows -- it is critical that you make those applications as efficient as possible so that the hardware can keep up with the workload. This applies to both the database and the client-side application code. Follow these tips and guidelines on fine-tuning your applications and you’ll improve SQL Server performance in no time.
It’s all about indexing
Pretty much everyone with even a little database experience agrees that indexes are crucial for developing efficient database applications. Yet in practice, indexing gets neglected or isn’t implemented properly. Often, senior-level software engineers develop applications that lack proper indexes (see “Four steps to proper table indexing” below).
Without proper indexes, SQL Server has to scan all data in the table to find matching rows. Imagine how difficult it would be to look for certain words in a book without having an index at the end. You would have to read the book cover to cover every time you needed to find something. But that’s essentially what SQL Server has to do anytime you look for data in a large table that’s not indexed. As a rule, you should index columns that are included in the WHERE clause.
Batch-processing and batch-import applications are used to read and process huge amounts of input data and move it over to a destination database. It is not unusual for these applications to handle millions of records. One mistake I have seen developers make in these types of applications is to write the code as if they were developing regular applications -- handling one update at a time as a self-contained piece of code.
For example, for each row processed by ADO.NET, an application starts a while loop: It creates a connection object, connects to the database, creates a command object, populates it with a newly created collection of parameters and makes a database call, and it cleans up all these objects as the loop is running. In a batch-processing application, all that initializing and cleanup is expensive. One time is enough.
For maximum efficiency, pull initialization and cleanup code out of the loop and tighten it as much as possible. The code that creates and initializes ADO.NET objects like connections, commands and parameters should not be in the while loop that processes the data. Create and initialize ADO.NET objects just once, and then modify the parameter values and update the database in the processing loop. Do the cleanup at the end, outside of the loop.
Remember, when you are processing millions of rows, even a few milliseconds saved makes a difference. When you multiply that by the astronomical number of processed records, you could be saving hours of processing time.
Careful what you COUNT(*)
Here is another coding inefficiency developers commit: “If the result of SELECT COUNT(*) is greater than 0, do something.” If you look at it more closely, you’ll see that what we want is “If at least one row meets the condition, do something.” The problem with using COUNT(*) here is that the SQL query will count all the rows that match; instead, we can stop it after finding the first match using the EXISTS query. So keep this subtle difference in mind and count everything only when you need the actual count. Use EXISTS when all you need is to check for the existence of at least one matching row.
Get queries, data types in sync
When writing Transact-SQL and ADO.NET code that uses parameters, it is important that you match data types in the queries with column data types in the tables. In some cases SQL Server can do an implicit data type conversion for you (e.g., convert a number to a string). But what you may not realize is that while the conversion of the data types will not affect the results, SQL Server may not be able to use the existing index on the column being queried.
For an example, let’s turn to Microsoft’s AdventureWorks sample database. In the Employee table, there is a column called NationalIDNumber; it is defined in the database as the variable-length data type NVARCHAR(15). The column is indexed, so SQL Server should be able to perform a quick lookup if you search for a particular ID. Now look at these two queries that return the same result:
* FROM [HumanResources].[Employee]
WHERE NationalIDNumber = N'233069302'
* FROM [HumanResources].[Employee]
WHERE NationalIDNumber = 233069302
Figure 1 shows the execution plans for both queries. (In each window, “Query cost” shows what percentage of execution time the query took, and percentages for each step are displayed below the query lines.)
Figure 1. This execution plan shows that when the data types match (top window), SQL Server performs an index seek to find matching data. When the data types don’t match in the second query, SQL Server uses a less efficient index scan.
In the first query, we search using an NVARCHAR string, N'233069302', so the data type in the query matches the data type in the table. SQL Server uses the existing index on the NationalIDNumber column and performs an index seek. That gives us a quick lookup.
In the second query, we search using an integer, 233069302, so there is a type mismatch between the query and the data type in the table. Subsequently, SQL Server converts the value in each row in the table from NVARCHAR to INTEGER to perform a comparison, and it scans all the rows in the table. It still uses the index, because it’s faster to scan all national IDs in the index pages as opposed to data pages, but a scan requires much more processing to deliver the same result.
In this particular database, the difference in execution is not very large (54% of total execution time for both queries versus 46%), but if the table contained millions of rows, we could be talking about an execution time taking valuable seconds instead of milliseconds. The difference in the query is subtle, but it’s an easy mistake to make if the programmer assumes the data type in the table was stored as an integer and doesn’t double check.
Indexing your databases makes the applications faster and more responsive, and efficient coding cuts down on processing time. When you are using small databases and your front-end applications aren’t accessed by many concurrent users, inefficiencies aren’t noticeable. But as the data piles on and your applications become more heavily used, the database and the applications can slow down to the point where they’re not usable, users will get timeouts and your customers will be unhappy. Following these coding and database efficiency practices and coding your applications with scalability in mind will ultimately improve SQL Server performance.
Four steps to proper table indexing
The importance of indexing for speedier applications cannot be stressed enough. Use this indexing checklist as a guideline for making sure your tables are indexed properly.
- Create a clustered index. To help SQL Server organize data more efficiently, each table should have a clustered index, which determines the order of data in a table. Choose your ordering method carefully (by last name, ID, date of birth) because you can only have one clustered index.
- Create a primary key. Each table should have a primary key to implement entity integrity. In some cases, you may also want to have a clustered index on the primary key. When you are doing a range search on dates, for example, it’s best to create a clustered index on the date column.
- Create unique constraints. If you have a group of columns in which each row should have a distinct value, enforce this requirement in the table design. You could implement it as a unique index or a unique constraint. The latter option is encouraged in SQL Server Books Online because it makes the objective clearer.
- Create indexes on foreign keys. You’ll want to create an index on your foreign keys so that child records can be found quickly. But don’t blindly create an index on each foreign key in the table. Take column selectivity into account and index only columns that will contain numerous distinct values. For example, indexing the OrderID column in the OrderDetails table makes sense, because the column is selective; it will have as many distinct values as the number of orders. But if you have a column like OrderStatusID, you will only have a few distinct values across the table, so the index won’t likely be used.
ABOUT THE AUTHOR
Roman Rehak is principal database architect at MyWebGrocerin Colchester, Vt. He specializes in SQL Server development, database performance tuning, ADO.NET and writing database tools. He contributes to Visual Studio Magazine, SQL Server Magazine and other publications and presents at user groups and conferences in the U.S. and Canada.