Runtime Column Propagation in DataStage - Data Warehousing

InfoSphere DataStage offers flexibility when it comes to handling metadata. It can manage situations where the metadata is not fully defined.

When data is sent from source to target, sometimes only required columns are needed. You can define a portion of your schema and specify that if your job encounters additional columns during runtime that are not defined in the metadata, it will adopt these extra columns and propagate them throughout the job. This process is called Runtime Column Propagation (RCP).

RCP can be enabled for a project via the Administrator client, and set for individual links via the Output Page Columns tab for most stages, or in the Output page General tab for Transformer stages.

Project level: in Administrator project properties
Job level: Job properties General tab
Stage/s: Link Output Column tab

If run time column propagation is enabled in the DataStage Administrator, you can select the Run time column propagation to specify that columns encountered by a stage in a parallel job can be used even if they are not explicitly defined in the metadata. It's essential to ensure that run time column propagation is turned on if you want to use schema files to define column metadata.

Run time column propagation is useful for partial schema usage. When we only know about the columns to be processed, and we want all other columns to be propagated to the target as they are.

Using RCP with Sequential Stages

Runtime column propagation (RCP) provides DataStage with flexibility regarding the columns defined in a job.

If RCP is enabled for a project, you can just define the columns you are interested in using in a job, but ask DataStage to propagate the other columns through the various stages.

So such columns can be extracted from the data source and end up on your data target without explicitly being operated on in between. However, it's important to note that sequential files don't have inherent column definitions, so DataStage cannot always tell where there are extra columns that need propagating.

To use RCP with sequential files, you must use the Schema File property to specify a schema that describes all the columns in the sequential file. You should specify the same schema file for any similar stages in the job where you want to propagate columns.

Sequential File
File Set
External Source
External Target
Column Import
Column Export

Understanding Runtime Column Propagation in DataStage

In the realm of ETL (Extract, Transform, Load) tools, DataStage by IBM stands out as a powerful solution for data integration. One of its key features is the ability to handle and propagate columns dynamically at runtime, a functionality we refer to as 'Runtime Column Propagation'. This article aims to shed light on this important aspect of DataStage.

What is Runtime Column Propagation?

Runtime Column Propagation is the ability for columns created or selected in one task to be made available for use in subsequent tasks within a DataFlow. This functionality allows for greater flexibility in designing and executing your data pipelines, as it enables the creation of dynamic, adaptable workflows.

When to Use Runtime Column Propagation

Runtime Column Propagation can be particularly useful in scenarios where the structure of the input data is not fixed or predictable. For example:

Data from multiple sources: If your pipeline ingests data from various sources with varying structures, you can use Runtime Column Propagation to handle this dynamicity.
Dynamic transformation rules: If the transformation rules for your data change during runtime or are determined at runtime, you can take advantage of this feature to adapt your pipeline accordingly.

How to Configure Runtime Column Propagation

To enable Runtime Column Propagation in DataStage, follow these steps:

Create a new DataFlow or open an existing one.
Identify the task where columns need to be made available for propagation.
Configure the task to create or select the necessary columns. Select the 'Add to Propagated Columns' checkbox for these columns, as shown below:
Use the propagated columns in subsequent tasks as needed.

Best Practices for Using Runtime Column Propagation

Minimize the use of Runtime Column Propagation to avoid unnecessary complexity and potential performance issues.
Document your DataFlow to make it clear which columns are propagated, when, and why.
Test your pipelines thoroughly to ensure data integrity and processing accuracy.