The Remove Duplicates stage is a crucial processing stage in Data Warehousing, particularly when dealing with large datasets. This stage can have a single input link and a single output link, processing a single sorted dataset as input, eliminating duplicate rows, and writing the results to an output dataset.
For optimal performance, it's recommended that the input data is already sorted for this stage, ensuring that all records with similar key values are adjacent. In case sorting is not done prior, a 'Link Level Sort' can be performed instead of adding a separate ‘Sort stage’.
Key - Specifies the key column for the operation. This property can be repeated to specify multiple key columns.
Duplicate to retain - Specifies which of the duplicate columns encountered to retain. Choose between 'First' and 'Last'. It is set to 'First' by default.
ID Name
10 Joe
11 Marsh
12 Shawn
10 Joe
10 Roger
Design the job structure as shown below.
Sort the data on ID column in a sort stage.
Map all the required output columns under ‘Output’ tab in Remove duplicate stage.
ID Name
10 Joe
11 Marsh
12 Shawn
Duplicate entries will get removed.