Remove Duplicates Stage in DataStage - A Comprehensive Guide

The Remove Duplicates stage is a crucial processing stage in Data Warehousing, particularly when dealing with large datasets. This stage can have a single input link and a single output link, processing a single sorted dataset as input, eliminating duplicate rows, and writing the results to an output dataset.

For optimal performance, it's recommended that the input data is already sorted for this stage, ensuring that all records with similar key values are adjacent. In case sorting is not done prior, a 'Link Level Sort' can be performed instead of adding a separate ‘Sort stage’.

Properties tab

Key - Specifies the key column for the operation. This property can be repeated to specify multiple key columns.

Options category

Duplicate to retain - Specifies which of the duplicate columns encountered to retain. Choose between 'First' and 'Last'. It is set to 'First' by default.

Example Input Data:

ID Name
          10 Joe
          11 Marsh
          12 Shawn
          10 Joe
          10 Roger

Step 1: Job Structure Design

Design the job structure as shown below.

Step 2: Sort Data

Sort the data on ID column in a sort stage.

Step 3: Configure Remove Duplicates Stage

Double click on the Remove duplicate stage.
Define 'key' as ID and 'Duplicate to retain' as 'First'.

Step 4: Map Output Columns

Map all the required output columns under ‘Output’ tab in Remove duplicate stage.

ID Name
          10 Joe
          11 Marsh
          12 Shawn
          Duplicate entries will get removed.

Alternative Methods to Remove Duplicates without using Remove Duplicates Stage in DataStage:

Using Hash File Stage (Specify the keys and check the unique checkbox, Unique Key is not allowed duplicate values)
Using a sort stage, set property: ALLOW DUPLICATES : false
You can do it at any stage. Just do a hash partition of the input data and check the options stable Sort and Unique.