Parallel processing, or Parallelism, refers to the simultaneous use of multiple CPUs or processor cores to execute a program or computational threads. This technique makes programs run faster as it leverages multiple engines (CPUs or Cores). In DataStage, parallel processing is supported through two types: Pipeline Parallelism and Partition Parallelism.
In pipeline parallelism, all stages run concurrently, even in a single-node configuration. As data is read from the source, it is passed to the next stage for transformation, and then to the target. This method allows the source data stream to start producing rows as soon as it begins, which are then passed to subsequent stages. All three stages operate simultaneously regardless of the degree of parallelism in the configuration file.
If you ran the example job on a system with multiple processors, the stage reading would start on one processor and begin filling a pipeline with data it had read. The transformer stage would start running as soon as there was data in the pipeline, process it and start filling another pipeline. The stage writing the transformed data to the target database would similarly start writing as soon as there was data available.
Partition parallelism is used for handling large volumes of data. It partitions the data into several separate sets, with each partition handled by a separate instance of the job stages. Partition parallelism is accomplished at run time, eliminating the need for manual processes required by traditional systems.
The DataStage developer only needs to specify the algorithm to partition the data, not the degree of parallelism or where the job will execute. By using partition parallelism, the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data. At the end of the job, the data partitions can be collected back together again and written to a single data source.
Parallel-processing environments can be categorized as Symmetric Multiprocessing (SMP) and Cluster or Massively Parallel Processing (MPP).
DataStage is a powerful ETL (Extract, Transform, Load) tool developed by IBM. One of its key features is parallel processing, which allows for improved performance and efficiency when dealing with large volumes of data. This article aims to explain DataStage parallel processing and how it can be utilized effectively.
Parallel processing in DataStage is a technique that breaks down a single job into smaller, manageable tasks. These tasks are then distributed across multiple processors or nodes, allowing the job to complete faster than if it were executed sequentially.
To configure parallel processing in DataStage, you need to understand the concepts of granularity and degrees of parallelism. Granularity refers to the size of the tasks assigned to each processor or node. Degrees of parallelism describe the number of processors or nodes available for executing a job.
// Set the maximum number of processors (degrees of parallelism) to 4 %%MAX_PROCESSORS(4); // Define a task with granularity of 100 rows (small) TASK small_task OPTIONS (ROWS=100)...
DataStage parallel processing offers significant benefits for ETL jobs dealing with large data volumes. By understanding the components, configuring granularity and degrees of parallelism, and considering the benefits and limitations, you can optimize your DataStage jobs for performance and efficiency.