Sort data using in-stage sorts and Sort stage

Reading Time: 3 minutes

In several processing stages, we can choose or set the sort criteria (i.e. an link or “in-stage” sort).  When you do that, one of those tiny “Sort” icons show up the metadata link.

So, you know that the data is being sorted between stages.  So you may be wondering, why there is a separate Sort stage even offered, and why would I ever need to use it?

For link sort, you can’t control how much memory is allocated, I think the default is 20MB; in the Sort stage, you can specify how much memory to use.

“Link Sort” uses scratch disk (physical location on disk), whereas “Sort Stage” uses server RAM (Memory). Hence we can change the default memory size in “Sort Stage”.

The Sort stage will tell OSH that the stream was previously sorted on a column(s), and also to not sort on that column(s) but to sort on an additional column(s) e.g. the stream is already sorted on Columns A and B (but not C), so you can specify that the key to sort on is A, B and C, but A and B were previously sorted, thus, only sort on column C.

In other words, if your job is having performance issues and you’ve narrowed the problem to sorting, many of the problems can be addressed by a separate Sort stage.

If the volume of the data is low, then we go for link sort. If the volume of the data is high, then we go for sort stage.

For an implicit sort: over the years, DataStage has gotten smart enough to insert a sort into OSH when it wasn’t specified in the code.  

For instance, you want to aggregate on column A, but the job didn’t specify to sort the data on column A before the Aggregator, so, DataStage will implicitly include/insert a sort in your OSH.