Incremental extracts

This is going to be a long post, but I see folks working way harder than needed to extract data for BI systems. Given the large quantities of data available in source systems, it’s obvious that it is not practical to transfer all of the data to the Business Intelligence solution on each extract. Incremental extracts are needed to cut quantity of data transferred down to a manageable level.

An incremental extract is simply the product of a program which will pull data from a source system for a specified date range.

It might be tempting to create an extract program which simply extracts data from the prior day. In actual practice, this often an undesirable approach. It is not unusual to encounter a situation where a given day or range of days worth of data needs to be re-extracted.

A much better way to implement an extract program is to provide the start and end dates at run time. This can be done by providing them as command line arguments when the extract program is invoked or by having the extract program pick the start and end date up from a database table or environmental variables.

When extract programs are designed to use start and end dates that are provided at run time, it is possible to query the existing data repository for the record loaded from that extract with the most recent date. This yields an extract facility which is, to a great degree, adaptive to the needs of the database. If a network failure prevents the data transfer for an entire day, no manual intervention is necessary. The next day when the extract runs it will extract data for two days instead of one. If you have to delete a week’s worth of data because of a problem with a source system then next execution of the extract will retrieve the appropriate amount of data.

While there are several ways to implement this type of extract, one way that is commonly supported is to create the extract program so that is gets the start and end date as command line arguments. Command line scripts can then be used to find the appropriate start and end dates then build the command line to start the program.

Using a Windows / SQL server environment as an example, the following SQL statement could be used to create a command line to start an extract program that would pull whatever data from a sales_facts table in the source system that is not currently in the data warehouse:

set nocount on

select ‘salesExtract ‘

+ (select convert(char(10), max(timestamp), 101) from sales_facts) + ‘ ‘

+ (select convert(char(10), dateadd(d, -1, getdate()), 101))

The above query would create the following output:

salesExtract 08/07/2013 08/08/2013

Assuming the above query was saved in a file named “salesExtract.sql” another batch file could be used to:

Create a batch file which would start the extract program
Start the extract program (which would create a text file in the landing area)
Delete the batch file

The following script would be saved as a batch file named “salesExtract.bat”, and would perform the steps outlined above.

sqlcmd -h -1 -s “localhost” -d “biDB” -i “salesExtract.sql” -o “salesExtractCurrent.bat” salesExtractCurrent.bat

del salesExtractCurrent.bat

In actual use, a job scheduler would be used to start the “salesExtract.bat” batch file, which would use the “salesExtract.sql” file to create a temporary batch file (“salesExtractCurrent”) that it would then execute. Once the extract was created in the landing zone, the salesExtractCurrent batch file would be deleted.

As previously mentioned, there are several ways to create an extract facility that will retrieve all of the data needed to bring the data repository current. The above example illustrates one approach that works in a variety of environments. However it is implemented, making the extract facility do more that simply retrieve the prior day’s activity will make the finished solution much more robust.

About Robert Hatton

Leave a Reply Cancel reply