Learn: Netezza Distribute on

When creating tables in databases and storing data, the data gets allocated into disk extents. Netezza distributes this data across all available data slices based on the specified distribution key during table creation. A maximum of four columns can be used as a distribution key if you prefer not to distribute randomly.

If a specific column is chosen for distribution, Netezza uses this column's data to distribute the records across dataslices using hashing. When random is selected as the option for data distribution, the appliance employs a round robin algorithm to distribute data evenly.

Data Slices in Each SPU

Ensuring that table data is uniformly distributed across all data slices helps maximize system performance. This way, all SPUs can process any query efficiently. Data imbalance, or data skew, may occur if data is not evenly distributed. This can slow down processing times and affect overall query performance. To avoid this, choose columns with high cardinality for distribution.

Netezza Distribution Key

In addition to distributing data during table definition, Netezza also allows organizing distributed data within a data slice. For example, you might have a distributed employee table on the employee id but want employee records from the same department stored together, in which case the department id column could be specified.

Netezza supports up to four columns for organization. When data gets stored on the data slice, it organizes based on these columns. This helps optimize joins and other operations.

Factors to Consider when Choosing a Distribution Key

Avoid using Varchar as a distribution key because it operates slower compared to Integer/binary keys.
Choose Random Distribution only if necessary, as it may lead to tables being redistributed or broadcasted, affecting performance on large tables.
If possible, distribute fact and dimension tables using the same column for optimal performance.

Useful Documentation

https://www.ibm.com/developerworks/community/blogs/Netezza/entry/distribution_what_s_up_with_that13?lang=en

Learning Netezza: Understanding Distributed Processing (DIST ON)

In this tutorial, we will explore the concept of Distributed Processing (DIST ON) in Netezza. This feature enables efficient data processing across multiple nodes in a Netezza system.

What is DIST ON?

DIST ON is a keyword used in Netezza SQL queries to specify that the query should be processed in a distributed manner. It helps to reduce the number of rows returned to the client and speeds up the execution of the query by allowing it to run across multiple nodes simultaneously.

Syntax

  SELECT column1, column2, ...
  FROM table_name
  DIST ON (expression)
  [WHERE condition]
  [GROUP BY grouping_expression];

In the above syntax:

expression - The expression used to partition the data across nodes.
condition (Optional) - A WHERE clause to filter the data before distribution.
grouping_expression (Optional) - Specifies grouping of results.

Example

Let's consider a table named "Orders" with columns "OrderID", "CustomerID", and "Total". We want to find the total sales for each customer, grouped by customer ID.

  SELECT CustomerID, SUM(Total)
  FROM Orders
  DIST ON (CustomerID)
  GROUP BY CustomerID;

Benefits of Using DIST ON

Improved Performance: By partitioning the data and processing it on multiple nodes, Netezza can significantly reduce the amount of data that needs to be transferred between nodes.
Scalability: As the volume of data grows, DIST ON allows Netezza to process larger datasets more efficiently.

Considerations

When using DIST ON, keep in mind that:

The expression used for partitioning should be selective and efficient to compute. This ensures that the data is evenly distributed across nodes and reduces the likelihood of hotspots.
Avoid using DIST ON with aggregations when the number of distinct values in the grouping expression is small or when the dataset is relatively small. In these cases, it may be more efficient to process the data sequentially.