When creating tables in databases and storing data, the data gets allocated into disk extents. Netezza distributes this data across all available data slices based on the specified distribution key during table creation. A maximum of four columns can be used as a distribution key if you prefer not to distribute randomly.
If a specific column is chosen for distribution, Netezza uses this column's data to distribute the records across dataslices using hashing. When random is selected as the option for data distribution, the appliance employs a round robin algorithm to distribute data evenly.
Ensuring that table data is uniformly distributed across all data slices helps maximize system performance. This way, all SPUs can process any query efficiently. Data imbalance, or data skew, may occur if data is not evenly distributed. This can slow down processing times and affect overall query performance. To avoid this, choose columns with high cardinality for distribution.
In addition to distributing data during table definition, Netezza also allows organizing distributed data within a data slice. For example, you might have a distributed employee table on the employee id but want employee records from the same department stored together, in which case the department id column could be specified.
Netezza supports up to four columns for organization. When data gets stored on the data slice, it organizes based on these columns. This helps optimize joins and other operations.
In this tutorial, we will explore the concept of Distributed Processing (DIST ON) in Netezza. This feature enables efficient data processing across multiple nodes in a Netezza system.
DIST ON is a keyword used in Netezza SQL queries to specify that the query should be processed in a distributed manner. It helps to reduce the number of rows returned to the client and speeds up the execution of the query by allowing it to run across multiple nodes simultaneously.
SELECT column1, column2, ... FROM table_name DIST ON (expression) [WHERE condition] [GROUP BY grouping_expression];
In the above syntax:
expression
- The expression used to partition the data across nodes.condition
(Optional) - A WHERE clause to filter the data before distribution.grouping_expression
(Optional) - Specifies grouping of results.Let's consider a table named "Orders" with columns "OrderID", "CustomerID", and "Total". We want to find the total sales for each customer, grouped by customer ID.
SELECT CustomerID, SUM(Total) FROM Orders DIST ON (CustomerID) GROUP BY CustomerID;
When using DIST ON, keep in mind that: