Performance Pain: To LIKE or not to LIKE
Posted by Raj Kashyap on Wednesday, November 24, 2010
We have agonized over excruciating SQL performance pains. Some of them defy elegant solution. Some don’t seem to go away, like a persistent migraine. Sometimes, the problem is a specific type of query that is always slow. This type of query is typically used in stored procedures, executed by several applications and batch processes. The performance seems to get drastically degraded as the table grows over time. Oddly, it has worked fine for some time in the past and nothing in the SQL itself has changed. Moreover, it is always fast when embedded in front-end applications. So, why is it slow when it is in a stored procedure?
Applies to: Sybase 15
The Innocuous Query
At first glance, the SELECT statement looks quite typical and harmless:
CREATE PROCEDURE FindTypes @categoryCd VarChar(4), @typeCd VarChar(4) BEGIN SELECT CategoryCd, TypeCd, EffDt, TermDt FROM HugeTable WHERE CategoryCd = @categoryCd AND TypeCd LIKE (@typeCd + '%') AND TermDt > GetDate() ORDER BY CategoryCd, TypeCd, EffDt END
For this scenario, let us consider the following table structure:
CREATE TABLE HugeTable ( CategoryCd VarChar(4), -- These three columns TypeCd VarChar(8), -- are indexed, EffDt DateTime, -- unique and clustered TermDt DateTime, OtherCol_1 ..., -- Other columns not OtherCol_2 ... -- relevant to the scenario ) CREATE UNIQUE CLUSTERED INDEX idx_PK_HugeTable ON HugeTable(CategoryCd, TypeCd, EffDt)
The procedure used to work reasonably well when the table was small. Though not very efficient, the performance pain was not noticeable. Since it was returning fast nobody bothered to check the execution plan; until the table grew larger and the pain became acute.
The Naive Assumptions
The problem with above type of query, specifically in Sybase, is the simplistic assumption that LIKE operator will use the index if “%” is used as the suffix. That is we tend to assume Column LIKE ‘Value%‘ will use the index and Column LIKE ‘%Value’ will not. That assumption may be true only when the Value is a string literal. When the Value to be searched is stored in a variable, that is in the case of Column LIKE @param + ‘%‘, Sybase decides to ignore the index and instead do a full blown table scan.
The other equally dangerous assumption is trusting the database to use the index at least for the first column (CategoryCd). But after I tested many different combinations of index and WHERE clause with LIKE usage, I was disappointed to find out that Sybase betrays everything we expect from a good optimizer, and decides to go with a full table scan – if the LIKE operator is not used with care.
In my specific scenario, it still would not have helped if Sybase used the index just for the first column because of the nature of data client had in the HugeTable. Out of the 15 million (and increasing rapidly) rows in the table, more than 95% had the same CategoryCd.
The Optimizers’ Dilemma
During several trials, I observed several strange index selection behavior (based on type of index and LIKE operator usage – to keep this short, I am not going to go over every scenario that I experimented with.) This ordeal eventually led to the following workaround, guideline and/or checklist, which seems to help the optimizer out of its dilemma:
- 1. Append “%” to the @typeCd before using it in the query. If the “%” is appended in the WHERE clause, Sybase will decide to read through the entire table.
SET @typeCd = @typeCd + '%' -- ... in the SELECT ... WHERE ...: AND TypeCd LIKE @typeCd -- ...
- 2a. In the WHERE clause, use all columns that are in the index. We had queries that were not using the EffDt column in the WHERE clause. Wherever possible, rewrite to use all columns in the index. Without this, Sybase will still do a table scan, even if the previous (#1) recommendation was implemented.
DECLARE @effDt DateTime SET @effDt = DateAdd(dd, -90, GetDate()) -- ... in the SELECT ... WHERE ...: AND EffDt >= @effDt -- ...
There could be scenarios where it may not be possible to use all the columns in the index, but effort must be made. In the above example, we could have SET @effDt = ‘1/1/2010′ (of course not hard-coded), just so that Sybase will choose the index. Such a start value can be ascertained based on the nature of the data in the table and the business need.
- 2b. If it is not possible to use all the columns in the index, all is not lost. Once #1 is in place, we are in a much better situation, in terms of performance. Now we can suggest the index to use:
-- ... in the SELECT ...: FROM HugeTable (INDEX idx_PK_HugeTable) -- ...
This is not really an ideal solution, but it is way better than the table scan. Based on the index hint, optimizer will use the first two columns in the index – CategoryCd and TypeCd. Keep in mind, if #1 is not implemented, using just the index hint, the optimizer will use only the CategoryCd column.
What would Microsoft Do?
If I have a problem in Sybase Adaptive Server, I usually try to recreate the scenario in MS SQL Server – and vice versa. You can utilize the knowledge and insight you get from one system, to make the other one perform better.
So I recreated the scenario on MS SQL Server 2005, setup the table and copied all 15 million rows over. Then ran the very first query without any modification – and – lo and behold – it returned in an instant. It just worked without any help, no matter how the parameter is supplied. I can use a string literal or parameter variable; append the “%” before the query or right there in the LIKE operator. No need to worry at all. So what does Microsoft do? We need to see SQL Server’s execution plan for our query…
Nested Loops(Inner Join, OUTER REFERENCES: (LikeRangeStart([@typeCd]+'%'), LikeRangeEnd([@typeCd]+'%'), LikeRangeInfo([@typeCd]+'%')) ) Compute Scalar(DEFINE: ([ConstExpr1004]=LikeRangeStart([@typeCd]+'%'), [ConstExpr1005]=LikeRangeEnd([@typeCd]+'%'), [ConstExpr1006]=LikeRangeInfo([@typeCd]+'%')) ) Constant Scan Clustered Index Seek(OBJECT: ([Ordeals].[dbo].[HugeTable].[idx_PK_HugeTable]), SEEK: ([Ordeals].[dbo].[HugeTable].[CategoryCd]=[@CategoryCd] AND [Ordeals].[dbo].[HugeTable].[TypeCd] > LikeRangeStart([@typeCd]+'%') AND [Ordeals].[dbo].[HugeTable].[TypeCd] < LikeRangeEnd([@typeCd]+'%')), WHERE: ([Ordeals].[dbo].[HugeTable].[TermDt]>GetDate() AND [Ordeals].[dbo].[HugeTable].[TypeCd] like [@typeCd]+'%') ORDERED FORWARD )
MS SQL Server’s parser, once it encounters the LIKE operator, it rewrites the query to use the LikeRangeStart() and LikeRangeEnd() internal functions. So, no matter how you supply the parameter, this strategy will work efficiently and use the appropriate index. Also note, since I ran the original query as-is, the EffDt column was not being used in the WHERE clause. I was not using all columns from the index in the WHERE clause. It still didn’t affect the performance; although it would be even faster if I did.
Why it works from the front-end?
In the case of the query always performing well when embedded in front-end application – it is because the query gets prepared and the parameter substitution happens before the statement reaches the Sybase database. So in this case, LIKE effectively is transformed to using the string literal Column LIKE ‘Value%‘, and Sybase gets this prepared statement, then it obviously uses the index.
That’s all folks!
Use LIKE with caution. All databases are not made equal. If you have the same query running against multiple database systems (example: Sybase ASE as the main DB and MS SQL Server for reporting), check the execution plan on all systems and ensure proper index usage. Complex solutions, that could improve performance, will complicate maintenance and almost always lead to more pain. When tuning, keep on looking until a simple, effective and elegant solution is found. That should help alleviate the pain.