The Remove Duplicates stage is a crucial processing stage in Data Warehousing, particularly when dealing with large datasets. This stage can have a single input link and a single output link, processing a single sorted dataset as input, eliminating duplicate rows, and writing the results to an output dataset.
For optimal performance, it's recommended that the input data is already sorted for this stage, ensuring that all records with similar key values are adjacent. In case sorting is not done prior, a 'Link Level Sort' can be performed instead of adding a separate ‘Sort stage’.
Key - Specifies the key column for the operation. This property can be repeated to specify multiple key columns.
Duplicate to retain - Specifies which of the duplicate columns encountered to retain. Choose between 'First' and 'Last'. It is set to 'First' by default.
ID Name 10 Joe 11 Marsh 12 Shawn 10 Joe 10 Roger
Design the job structure as shown below.
Sort the data on ID column in a sort stage.
Map all the required output columns under ‘Output’ tab in Remove duplicate stage.
ID Name 10 Joe 11 Marsh 12 Shawn Duplicate entries will get removed.