Data Partitioning and Collecting in DataStage

Reading Time: 5 minutes

Partitioning mechanism divides a portion of data into smaller segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel architectures like SMP, MPP, Grid computing and Clusters.

Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions into a single sequential stream (one data partition). 

etl process

Basically there are two methods or types of partitioning in Datastage:

Key less Partitioning – Partitioning is not based on the key column.

Key Based Partitioning – Partitioning is based on the key column.

Hash – In this method rows with same key column (or multiple columns) go to the same partition. Hash is very often used and sometimes improves performance, however it is important to have in mind that hash partitioning does not guarantee load balance and misuse may lead to skew data and poor performance.

Hash Partition guarantees that all records with same key column values are located in the same partition and are processed in the same node. 

Modulus – In this method data is partitioned on one specified numeric field by calculating modulus against number of partitions. Not used very often. 

Range – This is an expensive refinement to hash partitioning. It is similar to hash but
partition mapping is user-determined and partitions are ordered. Rows are distributed according to the values in one or more key fields, using a range map (the ‘Write Range Map’ stage needs to be used to create it). Range partitioning requires processing the data twice which makes it hard to find a reason for using it.

Same – In
this method existing partitioning remains unchanged. No data is moved between
nodes. Carries the previous partitioning.

Round robin –
In this method rows are alternated evenly across partitions. This partitioning
method guarantees an exact load balance (the same number of rows processed)
between nodes and is very fast.

Entire – In this method all rows from a dataset are distributed to each partition. Duplicated rows are stored and the data volume is significantly increased. All
records goes through one node and mirror image is replicated to all other
nodes. Never use this technique in case of funnel.

Random – As the name suggests it randomly distribute the data across all over the partitions and ensures approximately equal sized partition. The random partitioning has a slightly higher overhead than round robin because of the extra processing required calculating a random value for each record.

Auto – default.

It chooses the best partitioning method depending on:

The mode of execution of the current stage and the preceding
stage.

The number of nodes available in the configuration file.

Datastage Enterprise Edition decides between using Same or Round
Robin partitioning. Typically Same partitioning is used between two parallel
stages and round robin is used between a sequential and an EE stage.

Data collecting methods:

A collector combines partitions into a single sequential stream. Datastage
EE supports the following collecting algorithms:

Auto– the default algorithm reads rows from a partition as soon
as they are ready. This may lead to producing different row orders in different
runs with identical data. The execution is non-deterministic.

Round Robin– picks rows from input
partition patiently, for instance: first row from partition 0, next from
partition 1, even if other partitions can produce rows faster than partition 1.

Ordered– reads all rows from first partition, then second
partition, then third and so on.

Sort Merge– produces a globally sorted sequential stream
from within partition sorted rows. Sort Merge produces a non-deterministic on
un-keyed columns sorted sequential stream using the following algorithm: always
pick the partition that produces the row with the smallest key value. 

Partitioning

It is a technique of distributing the records across the nodes, based on partitioning techniques.

Partitioning Techniques
  • In addition, We have a 9th technique known as ‘AUTO’

  NOTE:

  • Partitioning techniques plays an important role in Performance Tuning

  Note:-

– – > Key-based technique assures that the same key column values are collected at the same partition.  

Ex:-

 EMP

DNO= Primary key  

E NOE NameDNO
11a10
12b20
13c10
14d30
15e20
D NO D Name Loc 
10ACEHyd
20MeterSec
30SalesEng

  When combine, I.e, using a horizontal combination

Horizontal combining

That is Same key column values are collected at the same partition  

Repatriating

The Portioned data is once again repatriated

Ex:  

ENameDnoLoc
A10AP
B20TN
C10TN
D20KN
E30TN
F10KN
G20AP
Repatriating
  • Partitioning and Repatriating are automatic processes in the Data stage

  Reverse Partitioning

  • Reverse Partitioning is collecting the data from the nodes.
  • It happens only in 1 Situation that is Parallel to Sequential.
Reverse Partitioning

Reverse Partitioning is also called as Collecting  

Different Collecting Methods

  1. Ordered
  2. Round Robin
  3. Sort – Merge
  4. Auto