(Pentaho) Stress Test 4: Flat to ORACLE

Interaction between Pentaho and Oracle via Conventional Inserts. Performance comparison between the results obtained by Talend, Datastage y Pentaho. PDI Peformance check list, good practices and development tips implemented, that improve the performance of the Data flow.


In previous articles, Datastage was who remained in first place in the Test: Impact over Oracle (without using BULK). In this case, we used Pentaho to load the data, using Conventional Inserts, and to compare the results against previous tests we have done. Until now, we have replicated this test with Talend (Bulk Load & Conventional Inserts) (Talend) Stress Test 4: Flat to ORACLE and Datastage (only Inserts) (IBM DS 8.1) Stress Test 4: Flat to ORACLE. The times obtained were good, but remained behind their competitorswith 196 seconds.

TIMES:

When developing this test, different techniques were applied to give the best Performance to the Data Flow, among them, we can mention,CSV Steps, Launch several copies of a step, Lazy Conversion, Commit Size and others.

You can find many more tips on PDI Performance tuning check-list mentioned by Matt Casters. who in March published on 10 years of Kettle (Celebrating 10 years of kettle coding)

Matt Casters

Lazy Conversion

In these cases where you are reading data from a text file and you write the data back to a text file, you can use Lazy conversion to speed up the process. The principle behind it is that it delays data conversion with the hope that it doesn't take place at all. (reading from a file and writing it back comes to mind)
Beyond helping with data conversion, lazy conversion also helps to keep the data in "binary" storage form. This in turn will help the internal Kettle engine to do faster data serialization. (sort, clustering, etc)

The Lazy Conversion option is available in the "CSV Input" and "Fixed input" text file reading steps.

Several Copies of Step

There are 2 main reasons why launching multiple copies of a step might result in better performance:

- The step uses a lot of CPU and you have multiple processor cores in your computer.

- The step has to deal with network latencies and launching multiple copies can reduce the average latency. If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly. You can try to reduce the round trips with caching, but if not, you can try to run multiple copies. Example: a database lookup or table output
- Please consider the drawback of running multiple steps in parallel, since they consume more CPU threads

CSV Step

It might make a big difference if you use the new "CSV Input" or "Fixed Input" steps as they have been re-written for optimal performance.
If you have a fixed width (field/row) input file, you can even read data in parallel. (multiple copies)

These new steps have been rewritten with Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be

ARCHITECTURE:

Environment: Infraestructure composed of 3 nodes

1) ESXi 5.0:

a) Physical Datastore 1: VM ETL Pentaho (10GB RAM - 2 Cores * 2 Sockets)

b) Physical Datastore 2: VM Database Server Oracle (6GB RAM - 2 Cores * 2 Sockets)

2) Monitor Performance: VM Monitor ESXi + SQL Server 2008 (with 4 GB RAM)

3) Operator ETL: ESXi Client (with 4 GB RAM)

CASE 1: CONVENTIONAL INSERTS

Objective:

- To measure elapsed time reading and writing 6 million rows, from Flat file, to Oracle Database.

- Analyze use of the resources.

ETL Tool Pentaho 4.0
Rows: 6.024.000 M
Columns: 37 Columns

Structure:

(Metadata)

Design & Run

LOG

Log:

Elapsed time (s) 196 Secs
Rows p/s (avg)

30.762 r/s avg

How to Improve

Performance

- Adjust the parameters:

- COMMIT Size

USE OF RESOURCES: VM PENTAHO

CPU:

CPU Monitoring, "Passive and Active state" in different executions. Last Execution: 19:34-19:37

Legend CPU:

Memory:

Memory Monitoring: Last Execution: 19:34-19:37

Legend Memory:

Network

Network Monitoring: Last Execution: 19:34-19:37

Legend Network

DataST

Legend DataStore

USE OF RESOURCES: VM ORACLE