Interaction between Pentaho and Oracle via Conventional Inserts. Performance comparison between the results obtained by Talend, Datastage y Pentaho. PDI Peformance check list, good practices and development tips implemented, that improve the performance of the Data flow. |
|
In previous articles, Datastage was who remained in first place in the Test: Impact over Oracle (without using BULK). In this case, we used Pentaho to load the data, using Conventional Inserts, and to compare the results against previous tests we have done. Until now, we have replicated this test with Talend (Bulk Load & Conventional Inserts) (Talend) Stress Test 4: Flat to ORACLE and Datastage (only Inserts) (IBM DS 8.1) Stress Test 4: Flat to ORACLE. The times obtained were good, but remained behind their competitorswith 196 seconds.
TIMES:
When developing this test, different techniques were applied to give the best Performance to the Data Flow, among them, we can mention,CSV Steps, Launch several copies of a step, Lazy Conversion, Commit Size and others. You can find many more tips on PDI Performance tuning check-list mentioned by Matt Casters. who in March published on 10 years of Kettle (Celebrating 10 years of kettle coding) |
|
Lazy Conversion |
In these cases where you are reading data from a text file and you write the data back to a text file, you can use Lazy conversion to speed up the process. The principle behind it is that it delays data conversion with the hope that it doesn't take place at all. (reading from a file and writing it back comes to mind) The Lazy Conversion option is available in the "CSV Input" and "Fixed input" text file reading steps. |
Several Copies of Step |
There are 2 main reasons why launching multiple copies of a step might result in better performance: - The step uses a lot of CPU and you have multiple processor cores in your computer. - The step has to deal with network latencies and launching multiple copies can reduce the average latency. If you have a low network latency of say 5ms and you need to do a round trip to the database, the maximum performance you get is 200 (x5) rows per second, even if the database is running smoothly. You can try to reduce the round trips with caching, but if not, you can try to run multiple copies. Example: a database lookup or table output
|
CSV Step |
It might make a big difference if you use the new "CSV Input" or "Fixed Input" steps as they have been re-written for optimal performance. These new steps have been rewritten with Non-blocking I/O (NIO) features. Typically, the larger the NIO buffer you specify in the step, the better your read performance will be |
ARCHITECTURE: Environment: Infraestructure composed of 3 nodes 1) ESXi 5.0: a) Physical Datastore 1: VM ETL Pentaho (10GB RAM - 2 Cores * 2 Sockets) b) Physical Datastore 2: VM Database Server Oracle (6GB RAM - 2 Cores * 2 Sockets) 2) Monitor Performance: VM Monitor ESXi + SQL Server 2008 (with 4 GB RAM) 3) Operator ETL: ESXi Client (with 4 GB RAM) |
![]() |
CASE 1: CONVENTIONAL INSERTS
Objective: |
- To measure elapsed time reading and writing 6 million rows, from Flat file, to Oracle Database. - Analyze use of the resources. |
ETL Tool | Pentaho 4.0 |
Rows: | 6.024.000 M |
Columns: | 37 Columns |
Structure: (Metadata) |
|
Design & Run |
|
LOG |
Log: |
Elapsed time (s) | 196 Secs |
Rows p/s (avg) |
30.762 r/s avg |
How to Improve Performance |
- Adjust the parameters: - COMMIT Size |
USE OF RESOURCES: VM PENTAHO
CPU: |
CPU Monitoring, "Passive and Active state" in different executions. Last Execution: 19:34-19:37 Legend CPU: |
Memory: |
Memory Monitoring: Last Execution: 19:34-19:37 Legend Memory: |
Network |
Network Monitoring: Last Execution: 19:34-19:37 Legend Network |
DataST |
Legend DataStore |
USE OF RESOURCES: VM ORACLE
|