Tuesday, 1 October 2013

Two ways to improve you delta performance in SSIS with non cryptography hashing.

This post is a tutorial on how to replace those slow and bulky SCD transforms in SSIS

To implement this you need the following:

SQL Server 2012, SSIS 2012 and an intermediate understanding of Data flow components.



Introduction:

We will look at two hashing techniques one that will cause strain on SSIS and one that will cause strain on your SQL server Data source. Both techniques have pros and cons for your ETL.

Both techniques are solid but you need to take a few things into consideration before you choose one I.E:

  • What is your data source, how big is your delta size, how much resources do you have on your SSIS server (remember each transformation will have a default buffer of 100MB that will be consumed in RAM)


Technique 1: Implementing hashing as a script component:

Benefits:

  • Your Data source can be of any nature, OLDB, ADO.Net, Files, SharePoint ODBC, etc...
Challenges and Issues:

  • You can not use this technique on an existing warehouse without a full reload as data can not be rehashed in your destination.
  • You will have additional buffer usage in your ETL as additional steps are required.
Process:

So the concept is quite simple and the pattern can be used in all types of ETLs
It starts with a Basic control flow that runs a Truncate of Stage then a Data Flow task that does a Load of Prod and Stage with a final SQL task that Update the Existing records from stage.


Data Flow Task:
1: A basic Select statement from a SQL Table
2: A Derived Column task that creates a Business Key string and a Delta hash string from keys and attributes
3: A scripting task that takes the two columns in step 2 and hashes them with the FNV1a hashing algorithm and converts them to BIGINTS 
4: A look-up tasks that looks up the BusinessKeyHash in the Destination table.
5: Insert of Non matched records in step 4 into the Production table
6: conditional split that discards rows where the DeltaHash is the same in Source and Destination.
7: Insert of Changed DeltaHash records into a staging table.

Please see the SSIS links below for the source. 

 
Technique 2: Implementing hashing from source (Hashbytes):

Benefits:

  • Simple and minimalist approach that requires fewer components and buffers
  • can be implemented post factor as destination tables can be rehashed.
  • Higher load speeds on large volumes of data.
Challenges and Issues:

  • You can not use this technique if your data source is other than SQL Server.
Process:

The base concept is the same as the 1st technique, 
  • Create a BusinessKeyHash by concatenating your business keys.
  • Create a DeltaHash by concatenating your attribute columns.
It uses the same three control flow steps:
  1. An execute SQL task that truncates a staging table.
  2. A Data Transformation task that loads source data into Production and Staging.
  3. A second Execute SQL task that updates Production with staged data.

1st task is as per technique 1 a simple Truncate Table {StageTableName}

The 2nd task has the same concept as technique 1 but with a fundamental change to where the hashes are generated:

To start off with our source query has now changed to use the "HASHBYTES()" function and we generate the BusinessKeyHash and DeltaHash columns upfront, by doing this we eliminate the Derived columns and scripting tasks that we had in technique 1.

Here is the steps and all you need to implement:

















1st Step:
This is a Simple select where we concatenate the relevant business keys in our source data and then introduce them to the hashbytes function, we then convert the outputs to a BIGINT IE:    
CODE:

 SELECT  Convert(nvarchar(100),[Location_ID]) as SourceKey
      ,[ISO] as CountryCode
      ,[ADM0] as CountryName
      ,[ADM1] as State
      ,Convert(BIGINT,HASHBYTES('SHA1',Convert(nvarchar(100),[Location_ID]))) as       BusinessKeyHash
 ,Convert(BIGINT,HASHBYTES('SHA1',ISO+ADM0+ADM1)) as DeltaHash
 
  FROM WorldProvinceGeography  

Please note that I have used SHA1 as my hashing algorithm instead of FNV1a as FNV1a is not supported in the SQL HashBytes function.

2nd Step:
We now need to do a look up on our destination so that we can ascertain if we have the records in our destination table, again is it important only to surface the look up and reference columns as we do not want to stress the ETL and SSISs Buffers.

In the lookup set no match handler to Ignore failures









In the connection window select the Destination connection manager and use a SQL Query
Please note: Only select the used and important columns as it will impact performance if you have a large data-set.







Finally set your look-up column mapping and output columns. 







The 3rd step is a conditional split, it is not required but I added it to illustrate the component.

All I do is create two conditions NEW and Existing based on null conditions of the look up conditions.







4th Step is a direct insert into the destination table of the NEW condition output.


5th Step is a 2nd look up task that looks for matched on the DeltaHash and BusinessKeyHash in the destination table, if they don't match they will then be inserted into the staging table in step 6 if they match they will be discarded.



Performance Analysis:

So the performance gain over a stock standard SCD with a sample set of 79000 rows on a direct insert is +- 8000%, now if we compare that against a TSQL insert and Update of the same set we get approximately a 100% gain over the TSQL (3.4 seconds on average from the Hash loads and 9 seconds on the TSQL). 

If we look at a full update the Hashing and TSQL still has the same results, but the SCD ramps processing time by 8 times over from 5 minutes to 40 minutes.



Verdict and Recommendation:

If you use only Microsoft SQL Server as a data source I would recommend using hashbytes as it is faster with larger amounts of data and allows for introduction into an existing data warehouse without running a back flush or trashing your historical snap shot facts.

Should you however load from flat files and other sources the scripting component makes it easier to deploy as you dont need to import into SQL for hashbytes and you can still run a full Enterprise ready ETL.


If you don't know how to use SSIS then i guess the TSQL option is for you but be aware that this will only work effectively if your source database and destination database is on the same SQL server instance.


Here is the Complete Project for SSIS and the databases that you need.


 If you have any queries please send me a Tweet @AndreasBergstedt or connect on google+


No comments:

Post a Comment