SSIS Lookup component – Beware the partial cache

SSIS Lookup component – Beware the partial cache

After publishing my previous blog about making a lookup case insensitive, I received a question from my most loyal reader, my wife. Just like me, she is a BI developer, so that makes her the perfect proofreader. After reading my post, she wondered what would happen if you used a partial cache mode, and the input data didn’t match the casing of the lookup data. So I started investigating…

Setup

For this test, our source data will be the following dataset (only showing top 27 records):

160913_1Figuur 1- Product Data

We will try to lookup the correct color ID for each product. Using the lookup dataset below:

160913_2Figuur 2 – Lookup data

To test the lookup we will be using a very simple SSIS package. The screenshot below shows the layout, and the total number of rows we are processing.

160913_3

Test 1 – No cache

As a baseline, we will use a lookup using “No Cache”. This will force SSIS to Query the SQL server instance for every record in the input. A trace on the SQL server instance shows us that this is indeed the case. Below you can see the bottom rows returned by the trace, but in total there were 2517 incoming calls. So as expected, we are querying the DB once for each input row.

160913_4

Test 2 – Partial, with matching lookup

For this test we will perform the lookup on column “ColorName”. This way, the input data and lookup data will match 100% (including casing). If everything works as expected, SSIS would look up each color once, and then store it in the cache. Once again, the trace proves our point. Each color is queried once, and from then on the lookup uses the cache. This results in only 19 calls to the database instance.

160913_5

And the Log Events (OnInformation) show us that the cache behaved exactly like we expected. 16 records were sent to the Server, the other 2501 rows were matched using the cache. Perfect!

160913_6

Test 3 – Partial, not matching case

So on to our main event. In this test we use “LowerColorName”, so the input data won’t match the case of the lookup data. So what will the partial lookup do now? The trace shows us a depressing result. Just like with the “No Cache” setting enabled, the database receives 2517 calls.

160913_7
And the log events confirm this terrible result.

160913_8
What happened? Shouldn’t the partial cache remember that it already looked up “Blue” and use the cache instead of querying the database? Well yes it should, and it does. What really happened is the following. The first record that arrives, tries to lookup “Silver” in the cache. At that moment the cache is empty, so it can’t find a match there. So SSIS sends the query to the database. The database does a (case insensitive) lookup and returns the value pair (“silver”,7) which is stored in the cache. The real cache uses HASHES to store its data, the table below is a human interpretation of the content of the cache.


Color ID
silver 7

Next we do a lookup for “Blue” and the same thing occurs.


Color ID
silver 7
blue 3

So far, so good. But then we receive another lookup for “Silver”. We already did a lookup for that, so we should be able to find it in the cache. However when SSIS checks the cache for “Silver”, it finds no match. And that is because the lookup in the cache IS case sensitive. So not finding a match in the cache SSIS hands the lookup to the database. And this will be repeated for all 2517 records. So in this instance, the partial cache mode became the worst version of the no cache mode. Both will query the server for each input row and the partial cache will first waste time by looking for it in his cache.

And things might even be worse than that. If we go back to the simplified cache used above, what would happen if we continue our example? So we have “silver” and “blue” already present in the cache.


Color ID
silver 7
blue 3

The next record that arrives is another “Silver” one. SSIS looks for “Silver” in the cache, but can’t find it (remember, the lookup is case sensitive). It queries the server which returns “silver”, 7. That value, however, is already present in the cache. So what will SSIS do? Is it smart enough to realize the duplicate record, or will it add the value again. Unfortunately, the event log doesn’t return the number of records of the size of the cache for partial cache mode. Rather strange, since it does return those values when running in full cache mode.