Thursday, 3 October 2013

Need some help in the BI stack??

Just discovered sqlservercentral.com's Stairway series...  


This is worth wile they cover everything from SQL server to BIML..


It is a good read with well written tutorials,

so give it a bash, sign up it is free :)



Wednesday, 2 October 2013

Practical use of Conditional split in SSIS to speed up your Sequential data load (Splitting the load)

This post is rooted in the challenge of handling sequential incremental load from an event based system that has multiple source tables or files IE: one table or file per month in a year.


Here is a practical example of how to speed up your sequential processing in SSIS:



1st thing is 1st:

Build you initial Sequential for each loop that processes all your source tables or files correctly.








Then Create and test your data flow task:



When this is done and tested create a copy of you package and start by building the following items:

1:Add a Data Flow Task to handle your list of files:



2: In the Task add a flat file Source(1) (if your reading Files, and a SQL Server source if your reading Tables)
that reads the list of file names together with a row number(  in SQL this would be a select from sys objects using a Row_Number() function for flat files you would use power-shell or your favorite scripting tool to generate the file list)
Then Add a Derived Column transform(2) that creates 2 additional columns
The ProcessID is the modulus of the sequential numbering called index in my solution, this means that you will get an even spread from 0 to 3 across your index range. 
The second column "FullFileName" is a concatenation of a folder location and the "Filename" column from my list. I will use that as my connection string in the steps to come.


After this we create a Conditional Split transform(3):
This evaluates the ProcessID and based on its value assigns an output called here P1, P2, P3 and P4.


Next is to insert the data into four record-set destinations (4,5,6,7)

They will be called from the control flow in four separate for each loops.


Now comes the fun part:

We now add four For Each Loop Container based on the ADO Collection:
Each ADO Loop is asigned to its induvidual Object Variable (As per the record-set destinations above)

Now take your previous DFT from the sequencial package and place it in each ADO Loop (Remember to change the variable mappings and add the additional flat file Data connections with the apropriate connection string expressions.



Once this is done then viola you are done...

here is the performance comparison I got:

So The Sequential package executed and processed 99 files with 790 records in each file in 1 minute and 4 seconds and the parallel package ran in 23 seconds with the same load.

Now one thing to remember is that the packages ran at the same time and against the same data source and destination, so obviously there was locks and IO lag as a result, but you can see that there is definitely a performance improvement.



I have attached the SSIS project and test files (and DB) in the link below:

Database
SSIS Project and Flat Files

Have fun .!

Tuesday, 1 October 2013

Why do you store a GUID in your database?

I remember way back when I 1st came across a database that used GUIDs to identify uniqueness,

Well today I don't see the point in storing the GUID.

A GUID is 128bits and displayed as a 32 byte hexadecimal string with an additional 4 bytes in separators displayed as Hyphens.

Most of the "Stored" version of a GUID that I have come across  is in a 36 Byte Unicode string and this is just because the Warehouse architect didn't care about how much data he loaded on the warehouse or its client.

Lets do some number crunching:

So lets say that you have 100 million rows in a table, the table has a row GUID and a Related GIUD (Forging Key) both stored in an NVARCHAR(36) column, this means that each row consumes 72 Bytes of data and that means that the table consumes a total of 6.7GB of data in identities alone.... so here is my suggestion, how about we use the GUID only as a seed in a hash and convert the hash to a Big INT IE:
CONVERT(BIGINT,(HASHBYTES('SHA1',CONVERT(NVARCHAR(36),NEWID())))

 This way we cut down the Stored GUID from 36 bytes to 8 Bytes and thus reducing the size of stored identities by a factor of 4.5 times, in our example above that would bring the space consumed down to 1.4GB.

If you now think that hold on how do I trace back to my source data (if you convert a system GUID (SourceKey) to a hashed BIG INT) well that is simple since you know how you seed the hash you can always calculate it and find it in your source and warehouse.


Happy hunting!

Two ways to improve you delta performance in SSIS with non cryptography hashing.

This post is a tutorial on how to replace those slow and bulky SCD transforms in SSIS

To implement this you need the following:

SQL Server 2012, SSIS 2012 and an intermediate understanding of Data flow components.



Introduction:

We will look at two hashing techniques one that will cause strain on SSIS and one that will cause strain on your SQL server Data source. Both techniques have pros and cons for your ETL.

Both techniques are solid but you need to take a few things into consideration before you choose one I.E:

  • What is your data source, how big is your delta size, how much resources do you have on your SSIS server (remember each transformation will have a default buffer of 100MB that will be consumed in RAM)


Technique 1: Implementing hashing as a script component:

Benefits:

  • Your Data source can be of any nature, OLDB, ADO.Net, Files, SharePoint ODBC, etc...
Challenges and Issues:

  • You can not use this technique on an existing warehouse without a full reload as data can not be rehashed in your destination.
  • You will have additional buffer usage in your ETL as additional steps are required.
Process:

So the concept is quite simple and the pattern can be used in all types of ETLs
It starts with a Basic control flow that runs a Truncate of Stage then a Data Flow task that does a Load of Prod and Stage with a final SQL task that Update the Existing records from stage.


Data Flow Task:
1: A basic Select statement from a SQL Table
2: A Derived Column task that creates a Business Key string and a Delta hash string from keys and attributes
3: A scripting task that takes the two columns in step 2 and hashes them with the FNV1a hashing algorithm and converts them to BIGINTS 
4: A look-up tasks that looks up the BusinessKeyHash in the Destination table.
5: Insert of Non matched records in step 4 into the Production table
6: conditional split that discards rows where the DeltaHash is the same in Source and Destination.
7: Insert of Changed DeltaHash records into a staging table.

Please see the SSIS links below for the source. 

 
Technique 2: Implementing hashing from source (Hashbytes):

Benefits:

  • Simple and minimalist approach that requires fewer components and buffers
  • can be implemented post factor as destination tables can be rehashed.
  • Higher load speeds on large volumes of data.
Challenges and Issues:

  • You can not use this technique if your data source is other than SQL Server.
Process:

The base concept is the same as the 1st technique, 
  • Create a BusinessKeyHash by concatenating your business keys.
  • Create a DeltaHash by concatenating your attribute columns.
It uses the same three control flow steps:
  1. An execute SQL task that truncates a staging table.
  2. A Data Transformation task that loads source data into Production and Staging.
  3. A second Execute SQL task that updates Production with staged data.

1st task is as per technique 1 a simple Truncate Table {StageTableName}

The 2nd task has the same concept as technique 1 but with a fundamental change to where the hashes are generated:

To start off with our source query has now changed to use the "HASHBYTES()" function and we generate the BusinessKeyHash and DeltaHash columns upfront, by doing this we eliminate the Derived columns and scripting tasks that we had in technique 1.

Here is the steps and all you need to implement:

















1st Step:
This is a Simple select where we concatenate the relevant business keys in our source data and then introduce them to the hashbytes function, we then convert the outputs to a BIGINT IE:    
CODE:

 SELECT  Convert(nvarchar(100),[Location_ID]) as SourceKey
      ,[ISO] as CountryCode
      ,[ADM0] as CountryName
      ,[ADM1] as State
      ,Convert(BIGINT,HASHBYTES('SHA1',Convert(nvarchar(100),[Location_ID]))) as       BusinessKeyHash
 ,Convert(BIGINT,HASHBYTES('SHA1',ISO+ADM0+ADM1)) as DeltaHash
 
  FROM WorldProvinceGeography  

Please note that I have used SHA1 as my hashing algorithm instead of FNV1a as FNV1a is not supported in the SQL HashBytes function.

2nd Step:
We now need to do a look up on our destination so that we can ascertain if we have the records in our destination table, again is it important only to surface the look up and reference columns as we do not want to stress the ETL and SSISs Buffers.

In the lookup set no match handler to Ignore failures









In the connection window select the Destination connection manager and use a SQL Query
Please note: Only select the used and important columns as it will impact performance if you have a large data-set.







Finally set your look-up column mapping and output columns. 







The 3rd step is a conditional split, it is not required but I added it to illustrate the component.

All I do is create two conditions NEW and Existing based on null conditions of the look up conditions.







4th Step is a direct insert into the destination table of the NEW condition output.


5th Step is a 2nd look up task that looks for matched on the DeltaHash and BusinessKeyHash in the destination table, if they don't match they will then be inserted into the staging table in step 6 if they match they will be discarded.



Performance Analysis:

So the performance gain over a stock standard SCD with a sample set of 79000 rows on a direct insert is +- 8000%, now if we compare that against a TSQL insert and Update of the same set we get approximately a 100% gain over the TSQL (3.4 seconds on average from the Hash loads and 9 seconds on the TSQL). 

If we look at a full update the Hashing and TSQL still has the same results, but the SCD ramps processing time by 8 times over from 5 minutes to 40 minutes.



Verdict and Recommendation:

If you use only Microsoft SQL Server as a data source I would recommend using hashbytes as it is faster with larger amounts of data and allows for introduction into an existing data warehouse without running a back flush or trashing your historical snap shot facts.

Should you however load from flat files and other sources the scripting component makes it easier to deploy as you dont need to import into SQL for hashbytes and you can still run a full Enterprise ready ETL.


If you don't know how to use SSIS then i guess the TSQL option is for you but be aware that this will only work effectively if your source database and destination database is on the same SQL server instance.


Here is the Complete Project for SSIS and the databases that you need.


 If you have any queries please send me a Tweet @AndreasBergstedt or connect on google+