(Talend) Stress Test 3: Lookups & Filters (RELOAD)

Lab Reprocessing "(Talend) Stress Test 3: Lookups & Filters", over a new hardware platform, reducing time to almost one quarter of the original version.


In this new version of Article (Talend) Stress Test 3: Lookups & Filters, we will measure the use of resources and the availability of the results, obtained by joining data between the flat file and MySql, filtering and writing the results on other flat file, everything on the new hardware architecture.

The "JVM Arguments" (After several changes looking for the best finish times) the parameters went from: [Xms:256M - Xmx:1024 ] to [Xms:2048 - Xmx:6144M] clearly impacting on the final results: 112s a 30s. This reduction of almost 3/4 of the time is a direct consequence of the increased rate of rows/sec [53.000 r/s to almost 200.000 r/s].

Leaving a minute this particular test, in real life, it is necessary to access the same dimension on different occasions, for example, Time Lks are common in different models. Generate a new connection to access these common tables for each model (generally in low volume) is redundant. It would be advisable to centralize the downloads, implementing in this case HashMap. What is a HashMap? is a Java data structure maintained in RAM, therefore the data access is much faster.

below it is attached a link to an interesting article (Efficient Lookups with Talend Open Studio's Hash Components http://bekwam.blogspot.com.ar/2012/05/efficient-lookups-with-talend-open.html), where is developed a case of access to dimensions, implementing both "connections to Table" vs "HashMaps".

In our particular case, due to the very low volume of data LK, the fast data access 0.16s, and to stop adding operators to the process, it was decided to directly access the database.

TIMES:

ARCHITECTURE:

Environment: Infraestructure composed of 3 nodes

1) ESXi 5.0:

a) Physical Datastore 1: VM ETL DS (10GB RAM - 2 Cores * 2 Sockets)

b) Physical Datastore 2: VM Database Server MySQL (6GB RAM - 2 Cores * 2 Sockets)

2) Monitor Performance: VM Monitor ESXi + SQL Server 2008 (with 4 GB RAM)

3) Operator ETL: ESXi Client (with 4 GB RAM)

CASE 1: -Xms2048M, -Xmx6144M

Objective:

To measure elapsed time reading 6 million rows, from Flat file, join the main flow with a lookup table (MySql) and take attributes.

Filter the flow and write a txt file.

ETL Tool Talend Open Studio 5.1 (CE)
Rows: 6.024.000 M
Columns: 37 Columns

Structure:

(Metadata)

Design & Run

LOG

Log:

Elapsed time (s) 30 Secs
Rows p/s (avg)

199.000 r/s (Before Filter)

How to Improve

Performance

- Adjust the parameters:

- XMX - XMS

USE OF RESOURCES: VM TALEND 64 Bits

TOTAL

Important: Memory Swap: 0

CPU/Datastore: CPU Usage Mhz / Datastore usage between 21:10-21:11

Menmory: After several executions, the memory consumption remains stable in 6,4 GB

CPU:

CPU Monitoring, "Passive and Active state" in different executions. Last Execution: 21:10-21:11

Memory:

Memory Monitoring: Last Execution: 21:10 - 21:11

Network

Network Monitoring: Last Execution: 21:10 - 21:11

DataST