Aggregator Stage in DataStage

Reading Time: 4 minutes

Aggregator stage is a processing stage in datastage it is used for grouping and summary operations. By Default Aggregator stage will execute in parallel mode in parallel jobs.   In a Parallel environment, the way that we partition data before grouping and summary will affect the results. If you partition data using round-robin method and then records with same key values will distribute across different partitions and that will give in correct results.  

Aggregator stage: Grouping keys category Group – Specifies an input column you are using as a grouping key. Repeat the property to select multiple columns as group keys. This property has a dependent property: Case Sensitive – Use this to specify whether each group key is case sensitive or not, this is set to True by default, that is, the values “CASE” and “case” in would end up in different groups.


Aggregator stage: Aggregations category   Aggregation type – Choose from Calculate (the default), Recalculate, and Count Rows. Recalculate – This aggregate type allows you to apply aggregate functions to a column that has already been summarized. Column for calculation – The Calculate aggregate type allows you to summarize the contents of a particular column or columns in your input data set by applying one or more aggregate functions to it. Select the column to be aggregated, then select dependent properties to specify the operation to perform on it, and the output column to carry the result.  

Aggregator stage: Options category The aggregate stage has two modes of operation: hash and sort.   Method – Your choice of mode depends primarily on the number of groupings in the input data set, taking into account the amount of memory available. You typically use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory to be used.   When using hash mode, you should hash partition the input data set by one or more of the grouping key columns so that all the records in the same group are in the same partition (this happens automatically if auto is set in the Partitioning tab). However, hash partitioning is not mandatory, you can use any partitioning method you choose if keeping groups together in a single partition is not important. For example, if you’re summing records in each partition and later you’ll add the sums across all partitions, you don’t need all records in a group to be in the same partition to do this. Note, though, that there will be multiple output records for each group.   If the number of groups is large, which can happen if you specify many grouping keys, or if some grouping keys can take on many values, you would normally use sort mode. However, sort mode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys (this happens automatically if auto is set in the Partitioning tab). Sorting requires a pregrouping operation: after sorting, all records in a given group in the same partition are consecutive. The method property is set to hash by default.   Allow null outputs – Set this to True to indicate that null is a valid output value when calculating minimum value, maximum value, mean value, standard deviation, standard error, sum, sum of weights, and variance. If False, the null value will have 0 substituted when all input values for the calculation column are null. It is False by default.  

For example, if we have the data as below

e_id,e_name,dept_no

1,john,10

2,rocky,20

3,micky,10

4,bibhu,20

5,russ,10  

Use Aggregator stage to find number of people each in each department.  

Job design:  

sequential_file ——-> Aggregator ——–> sequential_file

Read and load the data in source file. Go to Aggregator Stage and set below properties

Group=Dept_No Aggregator type = Count Rows Count Output Column = Count (This is User Determined)  

Output:

dept_no,count

10,3

20,2