Netezza uses the cost based optimizer to determine best methods determine redistribution, scan, join, join orders. Netezza query optimizer relies on the statistics from catalog to come up with an optimal query execution plan. So it is imperative that the statics need to be kept current without which the execution plan generated by the optimizer may be sub-optimal resulting in poor query performance. System uses the Netezza GENERATE STATISTICS command to generate information about each column proportion of the duplicate values, unique values, NULL values and the maximum and minimum values. The optimizer uses statistics information to determine the most efficient way to run a query.
How Statistics are collected?
Netezza system collects statistics in various ways:
- Statistics are automatically generated for certain database operations
- Collect full statistics by running Netezza GENERATE STATISTICS command on database or on individual tables. This operation is required when table changes significantly, say at least 10%.It usually collects all level of statistics, including dispersion values.
- You can generate statistics You should maintain the table by running Netezza GENERATE STATISTICS SQL command on regular basis.
- Just In Time Statistics (JIT): Runs at every execution time using sampling and zone maps
How Statistics are used?
Some examples of how the statistics can be used by the optimizer
- If a column is null able additional code may be generated to check whether the value is null or not
- Knowing the number of records in the table, min and max values along with the number of distinct values can help estimate the number of relevant records which can be returned for the query assuming there is uniform distribution
- Based on the min and max values the optimizer can determine the type of math required to be performed like 64 or 128 bit computation.
These statistics in the catalog are generated using the “GENERATE STATISTICS” command which collects them and updates the catalog. Admin users, table owners and users who have the GENSTATS privilege can execute the generate statistics command on tables in databases. The following are some examples:
Automatic Netezza Database Statistics
The Netezza system automatically generates two ‘Base Statistics’ (table row count and min/max values for non-character columns) when performing certain database operations:
- nzload operations
- INSERT operations
- UPDATE operations
- GROOM TABLE operations
- TRUNCATE TABLE operations
Netezza GENERATE STATISTICS Syntax
Syntax of the command:
GENERATE STATISTICS ON <tablename> [ (<col>[,<col>… ]) ];
Usage: The following provides sample usage. Generate statistics for all column of the table TEST:
TRAINING.ADMIN(ADMIN)=>GENERATE STATISTICS ON TEST; Generate statistics for all the tables in database TRAINING:
Netezza GENERATE STATISTICS Best Practices
Following are some of best practices when you are working on generate statistics:
- It is very important to collect full statistics by running Netezza GENERATE STATISTICS command when table changes significantly, say at least 10%
- You should always collect statistics when perform nzload, insert, update, delete or truncate operations
- Generate statistics for columns which are often used. For examples, columns used in JOIN conditions, WHERE clause, GROUP BY clause and ORDER BY clause
- You should GENERATE STATISTICS on all TEMP tables those are explicitly used in JOIN
- You should not abort the GENERATE STATISTICS command as this will disable the zone maps and rebuilt it. Any interruption leaves the zone maps disabled and this of course impact the performance.
JUST IN TIME (JIT) Statistics and Usage
Netezza automatically gathers statistics on the table with actual restrictions using sampler scan functionality, Just-In-Time (JIT) for planning. JIT Stats is automatically run on all qualified tables, such as:
- The table contains more than 5 Million rows
- Query contains at least one column restriction
- Restrictions do not contain sub-query or sub-plan expressions
- Must participate in a JOIN or must have a MATERIALIZED VIEW associated with them
- Must be a database table on disk No system tables, EXTERNAL TABLES or virtual tables
- JIT Dispersion values calculated on queried columns of tables > 500 Million rows
- JIT stats does not eliminate the need to run GENERATE STATISTICS