When users create tables in databases and store data into it, data gets stored in disk extents which is the minimum storage allocated on disks for data storage. Netezza distributes the data in data extents across all the available data slices based on the distribution key specified during the table creation. If you don’t want to distribute randomly, a maximum of four columns can be mentioned as distribution key during table creation process. If a user provides no distribution specification, Netezza uses one of the columns to distribute the data and the selection of which cannot be influenced.
When the user specifies particular column for distribution then Netezza uses the column data to distribute the records being inserted across the dataslices. Netezza uses hashing to determine the dataslice into which the record needs to be stored. When the user selects random as the option for data distribution, then the appliance uses round robin algorithm to distribute the data uniformly across all the available dataslices.
The key is to make sure that the data for a table is uniformly distributed across all the data slices so that there are no data skews. By distributing data across the data slices, all the SPUs in the system can be utilized to process any query and in turn improves performance. In addition, the performance of any query depends on the slowest SPUs handling the query. If the data is not uniformly distributed then some of the SPUs will have more data in the data slice attached to it called data skew which will impact the processing time and in turn the overall query performance. Selecting columns with high cardinality for the distribution is a good practice to follow. Even if a column with high cardinality like a date column is chosen to distribute data, there is a possibility of creating processing skew. For e.g. using the date column as the distribution key, the data gets distributed fairly evenly across all the data slices. But if most of the queries are looking for data for a particular month which is fairly often in a data warehousing environment, then only a particular set of data slices may need to be processed by the appliance which in turn will only utilize a subset of SPUs causing the query performing sub optimally. This is called processing skew and needs to be prevented by understanding the processing requirements and choosing the correct distribution keys.
CBT – Clustered Base Tables
Along with the option to distribute data during table definition, Netezza also provides an option to organize the distributed data with in a data slice. For e.g., we may have distributed employee table on employee id but wanted to have employee records from the same department to be stored closely together, the column dept id in the table can be specified.
Netezza allows up to four columns to organize on. When data gets stored on the data slice, records with the same organize on column values will get stored in the same or close by extends. Organize on improves queries on fact tables when they are defined with the frequently joined columns to organize the data. All the columns specified in the organize on clause are zone mapped and by knowing the range of values of these columns stored in each physical extent, Netezza can eliminate reading unwanted extents during a join query which improves the query performance.
The distribution of the data across the various disks is the single most important factor that can impact performance. Consider below points when choosing the distribution key:
- Column with High Cardinality.
- Column(s) that will frequently be used in join conditions.
- Avoid using a boolean column, which causes Data Skew.
- Distribute on one column whenever possible and do not create keys for the sake of distribution.
Avoid distributing the tables on columns that are often used in the where clause as it will cause processing skew. Date columns would be an example of where not to use this, especially in DW environments.
- Choose Random Distribution only as the last resort as it will more often lead to a table being redistributed or broadcasted. This is okay for a small table but will impact performance if done to large tables.
- When possible, distribute your fact and dimension on the same column.
- Even if both the fact and dimension table can’t be distributed on the same key, make effort to avoid redistribution on the fact table by choosing the right column for distribution.
- In addition to joins , colocation of data through table distribution or redistribution is also needed for count distincts (or any agg distinct), partition overs in analytic functions, and column grouping in group by or select distincts. A really good table distribution strategy will find a column, perhaps like a user_id or customer_id that can commonly be used for joins and these other calculations. If you can find a good common distribution like that things will run very very fast.
Why Varchar is not a good candidate for a distribution key:
- The datatype matters between a varchar and a bigint in more than just width of the field.
- The bigint can be compared to another bigint in the CPU with a hardware subtract, usually on a CPU register (very close to the hardware)
- A varchar however, has to be interpreted, that is, compared against each other in loop, requiring an algorithm in the CPU.
- Integer/binary keys will operate 100x faster than their varchar counterparts. This is why dates are ultimately stored as binary and numeric(18,x) as scaled bigint.
- Why does a binary key operate faster? If we perform a comparison on a varchar, it will have to loop through the characters, comparing one-or-several at a time until it gets a match/mismatch. With a binary value, it’s a register-based subtraction on the CPU, about as close to the metal as we can get. In short, binary comparison is on the deep-metal. Varchar comparison requires a firmware algorithm.
Some helpful Documentation: