Increasing ETL Throughput

Key concepts that a developer should know before creating an "ETL process"

"The ETL development team is expected to create ETL jobs that obtain the maximum possible throughput. We recommend the following ten rules, which are applicable for handcoded solutions as well as for various ETL tools for boosting throughput to its highest level"

by Ralph kimball

In this first theorical article "BI-LAB", I will speak about the topic of Performance in ETL processes. Regardless the technology used, is a common problem the performance degradation of the DW Architectures, and the urgency of its managers to shorten times, once time windows skyrocketed.

In most cases (although exceptions exist), the main failure is to initiate the development of an ETL process, with a transactional view, without thinking a minute how to provide a massive solution. This lack of vision clearly affects the results.

Among the errors, we frequently find the abuse of “UPDATES”. There are people who love the UPDATE!!, as well as the use of Store Procedures in middle of a process that "should" be massive (The bottlenecks in massive processes are the order of the day). More typical is "retrieve" all columns of a table no matter that uses only one.

And there are the orthodox who do not accept the use of temporary tables and they look at you as the devil when you mention that. When the Input have 20 joining tables, and even when execution plans showing the benefits of resolving certain load on Temporary Tables, they still refuse to use them.

The absence of using "Bulk" is constant, the difference in the performance between a normal vs. Insert Disabling Indexes + Dump data is huge..

I could go on mentioning a long list of errors, but I'm interested that you read this listing where Ralph Kimball summarizes 10 points to improve the performance of ETL development, I hope you learn as much as me.

Reduce I/O:

Minimize the use of staging tables. Pipeline the ETL to keep the data in memory from the time it is extracted to the time it is loaded.

We do maintain the stance that I/O must be reduced to an absolute minimum. Obviously, you may need to touch down data for various reasons. The number-one permissible reason to touchdown data is when you need to minimize access to the source system or if your source system allows only one-shot to retrieve the data you need from it.

In those cases, it’s good practice to write the extraction result set to disk as soon as it is retrieved. That way, in case of failure, you can always reprocess data from the saved copy instead of penetrating the source system again. Excessive I/O is a remarkably common offender. In most cases, intermediary tables or files can be omitted without any loss of functionality while their respective processes benefit from increased throughput. If you find yourself creating staging tables and many jobs to read and write to them, stop! Step back and analyze the total solution. By eliminating staging tables, you not only reduce I/O—the biggest performance hit—but also you reduce the number of jobs that need to be maintained and simplify the batch and scheduling strategy.

Eliminate database reads/writes:

When staging tables are necessary, use flat files instead of database tables when you must touch the data down to disk.

The ETL process often requires data to be touched down to disk for various reasons. It can be to sort, aggregate, or hold intermediate calculations or just retain for safekeeping. The ETL developer has a choice of using a database for these purposes of flat files. Databases require much more overhead than simply dumping data into a flat file. And ETL tools can manipulate data from a flat file just as easily as database data. Therefore, it’s a preferred practice to utilize sequential or flat files in the data-staging area whenever possible.

Filter as soon as possible:

Reduce the number of rows processed as far upstream in the process as you can. Avoid transforming data that never makes its way to the target data warehouse table.

This tip addresses what is possibly the most common mistake inETLdesign. Whenever we conduct design reviews of existing ETL processes, one of the first things we look for is the placement of filters. A filter is a component that exists in most ETL products that applies constraints to the data after it’s been retrieved. Filters are extremely useful because in many cases you need to constrain on fields from the source system that are not indexed. If you were to apply the constraint in the extraction SQL on a nonindexed field, the source database would need to perform a full table scan, a horrendously slow process. Conversely, if the source system indexes the field you want to constrain, this would be the preferred place for filtering because you eliminate extracting unwanted records.

Partition and parallelize:

The best way to increase throughput is to have multiple processes process the data in parallel.

  • Parallelize the source system query with parallel DML.
  • Pipeline and parallelize transformations and staging.
  • Partition and load target tables in parallel

Update aggregates incrementally:

Rebuilding aggregates from scratch is a process-intensive effort that must be avoided. You should process deltas only and add those records to existing aggregates.

Aggregates are summary tables that exist in the data warehouse specifically designed to reduce query time. Aggregates make a dramatic performance gain in the data warehouse because queries that had to scan through hundreds of millions of rows now can achieve the same results by scanning a few hundred rows. This drastic reduction in rows is attributable to the ETL process combining additive facts in a mechanical rollup process. More complex summaries that depend on complex business rules are not what we call aggregates in the dimensional world. Remember that an aggregate is used in conjunction with a query rewrite capability that applies a fairly simple rule to judge whether the aggregate can be used rather than a dynamic aggregation of atomic data at query time.

Take only what you need (columns and rows):

Similar to the filtering recommendation, do not retrieve rows unessential to the process. Likewise, do not select unessential columns.

It doesn’t make much sense to retrieve hundreds of thousands or millions (or even billions) of rows if only a few hundred of the records are new or have been modified since the last incremental ETL process. You must select a mechanism for retrieving deltas from the source system only. There are several ways to approach change-data capture depending on what’s available in the source transaction system.

Once you have the rows trimmed down to a manageable size for your incremental loads, you must next ensure that you don’t return more columns than necessary. Returning excessive columns is commonly encountered in look-ups inETLtools. SomeETLtools automatically select all of the columns in a table whether they are needed or not when it is used for a look-up. Pay special attention to explicitly unselect columns that are not vital to the process.

Bulk load/eliminate logging:

  • Utilize database bulk-load utility.
  • Utilize database bulk-load utility.
  • Turn off logging

Drop database constraints and indexes:

Foreign key (FK) constraints are unnecessary overhead; they should be dropped—permanently (unless they are required by your aggregate navigator). If FKs are required, disable them before the ETL process and enable them as a post-process. Leave indexes for updates and deletes to support WHERE clauses only. Drop all remaining indexes for inserts. Rebuild all indexes as a post-process.

Eliminate network traffic:

Keep working files on local disk drives. Also, place the ETL engine on the data warehouse server.

Let the ETL system do the work:

Minimize any dependency on DBMS functionality. Avoid stored procedures, functions, database key generators, and triggers; determine duplicates.

ETL products are specifically designed to extract, transform, and load massive amounts like no other nondata warehousing solution. With minimal exception, most databases are designed to support transactional and operational applications. Database-procedural programming is good for supporting data-entry applications but is not optimal for processing large data sets at once. The use of cursors—where each record is analyzed individually before moving on to the next—is notoriously slow and usually results in unacceptable performance while processing very large data sets. Instead of using procedures stored within the database, it’s beneficial to utilize the ETL engine for manipulating and managing the data.

Ralph Kimball

*Kimball & Caserta, The Data Warehouse ETL Toolkit, Indianapolis, Wiley, 2004.