Datastage Parallel Processing

Reading Time: 3 minutes

The simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads is called parallel processing or Parallelism. Ideally, parallel processing makes programs run faster because there are more engines (CPUs or Cores) running it. As you all know DataStage supports 2 types of parallelism.

1. Pipeline parallelism  

2. Partition parallelism

Pipeline parallelism

In pipeline parallelism all stages run concurrently, even in a single-node configuration. As data is read from the source, it is passed to the next stage for transformation, where it is then passed to the target. Instead of waiting for all source data to be read, as soon as the source data stream starts to produce rows, these are passed to the subsequent stages. This method is called pipeline parallelism, and all three stages in our example operate simultaneously regardless of the degree of parallelism of the configuration file.  The Information Server Engine always executes jobs with pipeline parallelism.

If you ran the example job on a system with multiple processors, the stage reading would start on one processor and start filling a pipeline with the data it had read. The transformer stage would start running as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the target database would similarly start writing as soon as there was data available. Thus all three stages are operating simultaneously.

As shown into the below diagram 1st record is inserted into the target even if the other records are in process of extraction and transformation.

Partition parallelism

Parallel-processing comes into play when large volumes of data are involved. It partition the data into a number of separate sets, with each partition being handled by a separate instance of the job stages. Partition parallelism is accomplished at run time, instead of a manual process that would be required by traditional systems.

The DataStage developer only needs to specify the algorithm to partition the data, not the degree of parallelism or where the job will execute. i.e the appropriate partitioning method can be used. Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data. At the end of the job the data partitions can be collected back together again and written to a single data source. 

Parallel-processing environments can be categorized as

Symmetric Multiprocessing (SMP) – Some Hardware resources may be shared by processor. Processor communicate via shared memory and have single operating system. SMP is better than MMP systems when online Transaction Processing is done, in which many users can access the same database to do a search with a relatively simple set of common transactions. 

Cluster or Massively Parallel Processing (MPP) – Known as shared nothing in which each processor have exclusive access to hardware resources. Cluster systems can be physically dispersed. The processor have their own operating system and communicate via high speed.