Change Data Capture

Data needs to be extracted periodically from the source system(s) and transformed to the data warehouse. This process is commonly referred to as refreshing the data warehouse. The most efficient refresh method is to extract and transform only the data that has changed since the last extraction.

The Change Data Capture technique in the Cúram Business Intelligence and Analytics infrastructure identifies and processes only the data that has changed in each of the tables in a database and makes the changed data available to the data warehouse. Cúram Business Intelligence and Analytics infrastructure has been designed with the intention that the refresh will take place on a nightly basis. However, the implementation is flexible and it is possible to run the refresh at a different frequency.

Change Data Capture include using a control table which stores a last written date for each table that is being populated. When an ETL runs, the last written field for that table is also updated. The next time the ETL runs, it first reads from this control table and then extracts the data that has been updated since the previous ETL run.

It is important to note that for change data capture to work, all the last written fields must be populated in the source tables that the Business Intelligence infrastructure extracts data from. As already stated, a row in the ETL Control table is updated before and after every ETL run for the table which is being updated. This works having the ETLs call a pre-mapping transformation to read the previous last written date and setting the extract time. After the ETL has run, a post-mapping transformation is called which updates the last written date to the current date.

After the ETL Control table has been initially populated with data, the last written date is reset to a start date to ensure that the ETLs extract all data updated after this date. It is possible to manually set the last written date for all tables in that database to the 1st of January, 1934.