Join Keys Best Practice
Use Integers as join keys for best performance results, example a surrogate key. Integers and integer based columns (e.g. timestamp and date) will compress with Netezza’s compression. As per Netezza docs, floating point numerics hash poorly and will force you to slower sort merge joins if joining on numeric columns that are not integers. Relying on a standard of integer data type on keys will avoid problems like a NOT achieving good join colocation. If two tables that need to be joined are distributed on customer id but one is defined as a varchar and other is defined as an integer, the data would not colocate because a varchar and an integer are not the same.
What is an extent?
An extent is smallest unit of storage on the disk that can be allocated (each extent is 3 MB)
Netezza Zone maps
Netezza automatically generates minimum and maximum values for every integer based column
- All Integer types (int1, int2, int4, int8)
By keeping track of the min and max values, Netezza will be able to avoid reading disk extends which will be most of the disk extends in a large data warehouse environment. If your data is naturally ordered (like by date) or sorted, these will behave much like Oracle’s partitions, and only the blocks needed will be read. For e.g. if a fact table stores five million records per month for the last 5 years which is more than 2.5 billion records. If the queries process only a month’s data, Netezza will be able to read only the five million records that are relevant. If there are 96 snippet processors and if the data is uniformly distributed across all the data slices, then the amount of data read into the snippet processors will be a little over 500,000 records which is small to process. Enabling the appliance to utilize the zone map feature by selecting the best data types for columns in the tables is another key point to take into consideration during design.
Other data types
Additional data types which can be zone mapped due to organize on clause (this creates a CBT – Clustered Base Table).
- Char – all sizes, but only the first 8 bytes are used in the zone map
- Varchar – all sizes, but only the first 8 bytes are used in the zone map
- Nchar – all sizes, but only the first 8 bytes are used in the zone map
- Nvarchar – all sizes, but only the first 8 bytes are used in the zone map
- Numeric – all sizes up to and including numeric(18)
- Time with timezone
Clustered Based Table for performance
In order to organize the existing records in the table, “GROOM TABLE” needs to be executed to take advantage of the data reorganization by queries. It is a good practice to have fact tables defined as clustered base tables with data organized on often joined columns to improve multi-dimensional lookup. At the same time care needs to be taken on the data organization columns by understanding the often executed queries and also minimizing the number of columns on which the data needs to be organized on. Compared to traditional indexes or using materialized views to improve performance, CBTs have a major advantage in terms of not using additional space but organizing the table data in place. One point to note is that by changing the data organization, it will impact the compression which may result in the increase or decrease in the size of the table storage after converting a table to a CBT.
Zone maps and joins
When two tables are joined together often like a customer table and order table, the distribution key selection of the two tables can play an important role in the performance of the queries. If the distribution key is on the join column, for e.g. customer id column in both the customer and order table, the data distribution will result in the records with the same customer id values ending up in the same data slice for both the tables. When a query joining the table is being processed, since the matching data from both the tables are in the same data slice the snippet processor will be able to perform the join locally and the send the result without performing additional work which in turn improves the performance of the query. If the tables are not distributed on the columns often used for join, matching data from both the tables will end up in different data slices which means, the snippet processor needs to perform additional work to satisfy the join. The appliance will choose to temporarily redistribute one of the tables on the join column if the other is already distributed on the join column and then the snippet processors can perform the join locally. If both the tables are not distributed on the join column, then the appliance may redistribute both the tables before the snippet processors can perform the join. If a table stored relatively small number of records then Netezza can decide to broadcast the whole table to all the SPUs so that each one has its own copy for processing. What this means is the host needs to consolidate the table data from all the data slices and send it across to all the SPUs the complete table data.