Learn: Netezza Distribute on

When creating tables in databases and storing data, the data gets allocated into disk extents. Netezza distributes this data across all available data slices based on the specified distribution key during table creation. A maximum of four columns can be used as a distribution key if you prefer not to distribute randomly.

If a specific column is chosen for distribution, Netezza uses this column's data to distribute the records across dataslices using hashing. When random is selected as the option for data distribution, the appliance employs a round robin algorithm to distribute data evenly.

Data Slices in Each SPU

Ensuring that table data is uniformly distributed across all data slices helps maximize system performance. This way, all SPUs can process any query efficiently. Data imbalance, or data skew, may occur if data is not evenly distributed. This can slow down processing times and affect overall query performance. To avoid this, choose columns with high cardinality for distribution.

Netezza Distribution Key

In addition to distributing data during table definition, Netezza also allows organizing distributed data within a data slice. For example, you might have a distributed employee table on the employee id but want employee records from the same department stored together, in which case the department id column could be specified.

Netezza supports up to four columns for organization. When data gets stored on the data slice, it organizes based on these columns. This helps optimize joins and other operations.

Factors to Consider when Choosing a Distribution Key

Useful Documentation

https://www.ibm.com/developerworks/community/blogs/Netezza/entry/distribution_what_s_up_with_that13?lang=en

 

Learning Netezza: Understanding Distributed Processing (DIST ON)

 

In this tutorial, we will explore the concept of Distributed Processing (DIST ON) in Netezza. This feature enables efficient data processing across multiple nodes in a Netezza system.

 

What is DIST ON?

 

DIST ON is a keyword used in Netezza SQL queries to specify that the query should be processed in a distributed manner. It helps to reduce the number of rows returned to the client and speeds up the execution of the query by allowing it to run across multiple nodes simultaneously.

 

Syntax

 
  SELECT column1, column2, ...
  FROM table_name
  DIST ON (expression)
  [WHERE condition]
  [GROUP BY grouping_expression];
  

In the above syntax:

   

Example

 

Let's consider a table named "Orders" with columns "OrderID", "CustomerID", and "Total". We want to find the total sales for each customer, grouped by customer ID.

 
  SELECT CustomerID, SUM(Total)
  FROM Orders
  DIST ON (CustomerID)
  GROUP BY CustomerID;
  

Benefits of Using DIST ON

   

Considerations

 

When using DIST ON, keep in mind that: