Tuesday, 11 November 2014

How to make your ODS data ETL Dynamic

How to make your ODS data ETL Dynamic Part 1



Back ground to this Blog:

Before I begin I need to reiterate that this is a three part series, I have decided to make it in three parts as I would not finish the blog if I tried to make it a novel.

We all know that SSIS is not dynamic out of the box and this has been its pitfall in most business cases and value propositions when a Microsoft BI Organisation comes up against the likes of  the Informatica Sales machine.

The truth is that Informatica is just as dynamic as SSIS the difference is that Informatica spent millions of Dollars on module development to make the business case easy, where as with most Microsoft partners the ETL sale is always a cost vs. ability sale.

If we look at what Informatica can do vs. SSIS we will soon realize that SSIS can do everything that Informatica can do and at a fraction of the cost in licensing, the difference is that the Microsoft BI partner that makes the SSIS sale will need to spend a whole lot more on custom development than the Informatica partner (Well depends on the project and all it's variables).

So here is my attempt to help the Microsoft BI partner community to excel in their technical ability and enable the data take-on process to work faster for them.

Why should I look at "Black Box" Development for my data take-on?:


Before I begin with the business case we need to answer the following question

What is an Operational data store?
  1. An operational data store (or "ODS") is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

As a norm you try to keep your ODS to an "As Close As Possible" representation of your source system as possible with minimal transformation and this statement makes the "Dynamic and Generic" ETL an almost perfect fit for any ODS ETL project.

There is obviously a bit of project specific development that comes with every project but if you can "Black-Box" 99% of the development effort it will make sure that you will meet your data take on faster and more robust from a project delivery. IE:

Traditional ODS ETL Data Take on development time calculation:
Source Tables * ETL Development Time + Deployment Time = Effort

If we have 1 table to take on and each ETL will take 30 minutes to develop, deploy and send for UAT then there is no need to go the "Black-Box" route as it will take you just half an hour.

However if we look at taking on 100 tables then the traditional effort amounts to an estimated 50 hours in the above case.

If we take the "Black Box" approach the development time calculation is as follows:

Configuration time + Deployment Time = Effort

And the following general rule can fit 99% of cases 30 Minutes to configure and 30 minutes to Deploy in other words in 99% of all ODS ETL development cases it will take you 1 hour before you can run your 1st complete ODS ETL and send of to UAT for testing 1 table or 100 tables, 1 source file or 100 source files....

at this point you will probably be thinking, "ok but what about transformations?" and my response to that will be, look at the definition of an ODS above and rethink your question.

And the above is obvious, you will go the "Black-Box" route as 1 hour  for development vs 50 hours.


How is it done?


The "Black-BOX" solution is designed to dynamically create all your source data as a one to one in the ODS, it is not designed to handle delta from source but this can be easily managed in the ODS itself.

The solution can handle two types of source data, Microsoft SQL Server databases and text files like CSV files out of the box.

You can filter source Tables on Schema name and Table name strings (with wild cards)
and text files based on file name and extension string patterns.

The solution can handle a single ETL run of both Files and Tables and multiple Database sources and multiple file locations and search patterns.
It is important to understand that taking on large flat files might not be super efficient with this solution and if you are expecting files with more than 300000 rows you might want to load them separately unless they will be delta populated from your source system.


Lets step the process flow in the picture below:



Load Configuration:


This sub process creates if necessary and loads the configuration tables in your Database into two Result set Object Variable that is used to loop thru the ETL and contains the necessary indicators for the solution to execute the SQL Server ETL or the CSV file ETL.



Pickup Source Type:

This is an evaluation made based of the SSIS variable sSourecType if it is "SQL" then run the LOAD SQL TABLE CONFIGURATION if it is "CSV" then run the LOAD CSV FILE CONFIGURATION
at this point don't get confused with package configuration it simply just calls the correct control flow based on a presidents constraint.


SQL Source Database Processing:

The SQL Server ETL breaks up the source tables and views into their column definitions (Name, Data Type and null-ability).

It then populates the ODS Configuration tables so that the table can be recreated later.
After this the ETL runs a generic ETL per table that loads a vertical table grouped per source table name, server and database.

Once the above is completed each table is recreated as per its source type and the vertical data stage table inserts the correct data into the newly recreated tables.



CSV Source File Processing:

The difference between the SQL processing and the CSV processing is that the source files will not be identified for its "true" data type as we do not have rich metadata and there is no guarantee that the source files won't change or contain the same datatypes, however once all source files have been inserted into the data stage tables the ODS table is updated to optimize the consumption of space and the definitions are loaded into the configuration tables where datatypes can be manually reconfigured and the same relationship mapping can be configured as is true in the SQL Database processing above.



At this stage the ODS is Staged and the data take-on is completed.
Part three will deal with how to configure filters and key management in the ODS as well as how to expand the configuration of relationship mapping of your ODS.

 Part two of the series will cover the SSIS package and code review, it will be out the last week of November 2014.



No comments:

Post a Comment