At the point when you create a table in Netezza database, selecting column/s being a member of distribution key is one of the critical decision for good execution of the queries that will run against the table.
The following are some of the key points to remember when we select distribution key:
- Maximum of four columns can be a member of distribution key
- We can have only one distribution key per table (Which is quite obvious)
- We cannot modify column (by issuing alter table command) which is a member of distribution key
- We create a table with random distribution then Netezza will store data based on round robin algorithm
- When we create a table by issuing “select * from table …….” then the newly created table will inherit the distribution key of source table.
- Always try to avoid using different distribution key when creating a table using “select * from table …….” as this can result in transfer of data across SPU’s because of change in distribution
- When we create a table, if no column is selected as distribution key then Netezza will randomly distribute on any column of the table and not always the first column (Most of the cases it first column but in theory it can be any column)
Choosing right distribution key is one of the important factor to improve the performance of Netezza server. If you have created the table with RANDOM distribution or with different column with lots of duplicate records then you should immediately change the distribution key otherwise that will reduces the performance. Changing Netezza table distribution key is process of redistributing the Netezza table using Netezza nzsql.
Redistribute using CTAS
Creating new table and loading data at same time is one of the best way to redistribute table. You can use CTAS ( Create table AS) and is one of fastest way to re-create Netezza table.
CREATE TABLE <CTA_table> AS
SELECT * FROM <existing_table> DISTRIBUTE ON (QUERIED_colume(s));
CREATE TABLE new_sales AS SELECT * FROM sales DISTRIBUTE ON (Year);
DROP TABLE sales;
ALTER TABLE new_sales RENAME TO sales;
Redistribute using Intermediate Table
The main drawback of using only CTAS is that you have to drop the original table and rename newly created table to original table. This process will invalidate the any views built on top of base table. You have to explicitly re-compile all views which are built on top of old table.
To overcome above mentioned issue, you can follow tow steps procedure to re-distribute table. First, use CTAS to create new intermediate table from old table, then, second, truncate original table and reload data from intermediate table.
CREATE TABLE int_sample_sales AS SELECT * FROM sample_sales DISTRIBUTE ON (MONTH_ID);
INSERT 0 7
TRAINING.ADMIN(ADMIN)=> TRUNCATE TABLE sample_sales;
TRAINING.ADMIN(ADMIN)=> INSERT INTO sample_sales SELECT * FROM int_sample_sales;
INSERT 0 7
You can also create the scripts to re-distribute the Netezza table using aginity workbench. Here is the example of scripts that aginity generates for redistribution of Netezza table:
CREATE TABLE USER1."RENAME_DEMO_
) DISTRIBUTE ON (ID);
INSERT INTO USER1."RENAME_DEMO_
SELECT * FROM USER1.RENAME_DEMO;
ALTER TABLE USER1."RENAME_DEMO_ SET PRIVILEGES TO USER1.RENAME_DEMO;
ALTER TABLE USER1.RENAME_DEMO RENAME TO USER1."RENAME_DEMO_
ALTER TABLE USER1."RENAME_DEMO_ RENAME TO USER1.RENAME_DEMO;
ALTER TABLE USER1.RENAME_DEMO OWNER TO USER1;
DROP TABLE USER1."RENAME_DEMO_
GENERATE EXPRESS STATISTICS ON USER1.RENAME_DEMO;
Changing Netezza Table Default Distribution key
By default, Netezza will distribute on the first column if you do not specify the distribute on hash(column) command.
To change this default distribution property, you can update the /nz/data/postgresql.conf configuration file to add the below entry:
enable_random_table_distribute = 1
You need to perform nzstop/nzstart the database to use updated properties or you can issue command “pkill -HUP -f postmaster”.
Distribution, Not Just for Joins
There are several other scenarios where data needs to be distributed to process.
First one that comes to mind are count distincts. Take for example a count of distinct users or customers like count(distinct customer_id) perhaps when doing analysis on purchase transactions. An example sql might be
select count(distinct customer_id) from purchase_transaction
For argument we’ll say that purchase_transaction has random distribution. So Netezza has to break up the job. It can’t just count distinct on every spu because you can’t just add up the numbers at the end because some customers might be on more than one spu. The way to make it where you can just add them up is to make sure a customer can’t be on more than one spu, and that means you need the data distributed by customer. That’s exactly what the query plan will do, redistribute on customer. But of course if they data is already distributed on customer, then you save having to do a redistribution.
This is just one example of where Netezza needs all of a grouping of data together. Another is windowing analytic functions like rank , dense_rank, lead, lag, etc. They always have a partition over set. To do these partition overs, you need all the data for each partition on one spu. If you partition over on customer_id, reusing our above example, and perhaps want to do a rank on dollar amount for each customer, again it will redistribute the data to get each slice of customer data located together on a single spu.
Still another example is any group by aggregate function with group by columns. If the data set is small enough and all the aggregate functions are summable, it may try to merge each spu’s results together on the host. But look at your plan files, often if the result set will be large, it will distribute by some or often all of the group by columns.
Select distincts are in may ways like a group by without any aggregate functions. So to get like values together , it will redistribute on all or most of the column set.
A key thing to realize is netezza doesn’t need to distribute purely on all of the distinct, group by , or partition over columns. It just needs to make sure that all of that set of data is there local on the spu. For example if you had zipcode and customer id. Lets say we know that customer is a pure child of zipcode, in that a customer only has one zipcode and a zipcode has many customers. It doesn’t need to redistribute on customer_id if it’s already distributed at the zipcode level. But it has to know this relationship. So that means in distinct or group by to make sure that zipcode is included. Or in a partition over statement partition over zipcode, customer_id. What I’ve seen if the optimizer sees that it is already distributed at that higher, courser grain AND it sees that that grain’s cardinality has things pretty evenly spread out, it will usually just leave things at that distribution. It can do this because if customers don’t cross zipcodes, and zipcodes don’t cross spus, then customers don’t cross spus. What I do see though is if it’s going to need to do redistributions, it would redistribute on all of the column values. I see this especially on group bys. Sometimes joins and partition overs are a little smarter if there is a common higher grain to handle both.
So to summarize, in addition to joins , colocation of data through table distribution or redistribution is also needed for count distincts (or any agg distinct), partition overs in analytic functions, and column grouping in group by or select distincts. A really good table distribution strategy will find a column, perhaps like a user_id or customer_id that can commonly be used for joins and these other calculations. If you can find a good common distribution like that things will run very very fast.