With any text search, performance is the most important factor that should be addressed. What approach is going to run fastest for the business requirements and data content?
SQL Server full-text performance can be improved by doing the following best practices:
- Reorganize weekly and rebuild the catalogs on at least a monthly basis (optimal configurations will vary by load). Failure to do so will lead to fragmentation of the full-text index and make indexes far less efficient.
- The full-text key should be a small column. It's best to pick an
- int type, since that does not require a join to the DocID mapping table, thus improving performance. It's also faster if the key is a clustered index.
- Using the TOP_N_BY_RANK argument increases query performance since it only processes the top n number of records.
- Locate the catalogs on separate disks. I would also recommend 15K RPM drives.
- The first parameter determines the execution plan, so you might want to use OPTIMIZE FOR to pass in an optimal value to get the correct execution plan.
- I would recommend against specifying the specific index unless it's a last resort, since it generally causes problems as indexes change.
- Ensure that sp_configure 'max full-text crawl range' is set to the number of CPUs that you have available.
- Update statistics and place a second index on a timestamp column to improve incremental loads of the full-text population.
- Rather then using multiple CONTAINS predicates, try to combine them into one CONTAINS predicate containing multiple columns.
Typically, this type of execution plan will occur when the parameter passed in has a low degree of selectivity. In other words, it's usually easier to do a table scan since the query will probably return many rows compared to the total size of the table.
The cost of this approach is (table size 2,116,400 rows, 4.1gb data)
Rows: 124, 784
Now let's say that we want to force a seek. We could tell the query to optimize for something that will return a smaller result set. For example, let's use the following query, which takes a query (performance) with a large result set and optimizes it for "test", which has a smaller result set. This causes the large query to use seeks as seen in the plan below:
declare @N nvarchar(30) set @N =N'*performance*'
select * from LargeProperty_test where contains(largeProp_value, @N) OPTION (OPTIMIZE FOR (@N = '*test*'))
Rows: 124, 784
Comparison to Like Predicate Performance
This test compared a table size of 2,116,400 rows consisting of 4.1GB of data. The performance difference would narrow if the number of rows decreased or if the volume of text in the column was reduced. The performance difference would also drastically increase as the volume of text and/or the number of rows increased.
In most scenarios, the CLR stored procedure approach will actually degrade your performance over these solutions rather than improve it, so approach this methodology with caution -- especially since scope creep will probably become a problem quickly.
Shredding the text columns by word into a clustered index that has a 100% fill-factor will drastically increase your speed, though at the expense of flexibility. This would be better suited as an optional feature or compliment to Microsoft's full-text search.
Caching search results from full-text search would do more than just provide you with the flexibility of the Microsoft tool. Although initial searches would be much slower, any subsequent searches on the same keywords would be returned instantly since the results would be cached in another table optionally with the keywords as the clustered index -- a huge benefit if keywords tend to be repeated often.
I would not solely rely on SQL Server full-text search unless longer response times are acceptable. I would instead recommend a hybrid approach utilizing full-text search and another option for busy systems.
For fast text searching, I would definitely recommend 15K RPM drives separated into different RAID configurations for all components, including catalogs, TempDB, dump, logs, data, non-clustered indexes, etc. If this proves to be too slow along with a hybrid solution, then you might need to look at solid state servers rather than utilizing hard drives.
ABOUT THE AUTHOR
Matthew Schroeder is a senior software engineer who works on SQL Server database systems ranging in size from 2 GB to 3+ TB, with between 2k and 40+k trans/sec. He specializes in OLTP/OLAP DBMS systems as well as highly scalable processing systems written in .NET. Matthew is a Microsoft certified MCITP, Database Developer, has a master's degree in computer science and more than 12 years of experience in SQL Server/Oracle. He can be reached at firstname.lastname@example.org.
This was first published in May 2009