Considerations for better performance:
Bad distribution cause the data skew (more data stored in certain data slices while less data is stored in other data slices). That affects the query in hand and others as the data slice has more work to do. Consider distributing on random if you don’t know the right distribution key.Data types
It is important to select a data type that will minimize disk storage requirements and minimize scan time. The right choosing right data type even reduce the disk I/O. • Use Char(x) instead of Varchar(x) when you expect the data to be a fixed length as this not only helps to save disk space but also helps performance due to reduced I/O (Varchar(x) uses additional storage which will be significant when dealing with TB of data and also impacts the query processing since additional data need to be pulled in from disk for processing).
• Where possible, use the NOT NULL constraint for columns in the tables, especially for columns that are used in where clauses (Restriction) or join conditions. This will help improve performance by not having to check for null condition by the appliance and will reduce storage usage.
The nzsql GENERATE STATISTICS command generates statistics about each table column’s proportion of stats, including duplicate values, maximum value, minimum value, null values, dispersion values and updates the system catalog table. The optimizer relies on GENERATE STATISTICS to gather statistics about the tables in the database to determine the most efficient way to execute a query. It is important that the optimizer always has high quality statistics to choose the best execution plan from all possible plans.
A zone map is an internal mapping structure to show the range (min and max) of values within each page. During scans, zone maps reduces I/O by skipping pages that did not qualify the query parameters. Zone maps are internal to the system thus no administration involved.
Clustered base tables
A Netezza clustered base tables (CBT) are user table that has data which is organized using one to four organizing keys columns. You can specify max four columns in organize on clause and those columns should not be a part of distribute on clause.
Groom table command
Groom tables that receive frequent updates or deletes or if a load or insert is aborted as this will result in deleted rows. You can groom at a record level to remove all deleted records regardless of their location. This will give you the best space gains but take longer.
The following are some of the common reasons for performance issues
- Appropriate distribution keys – may result in certain data slice storing more data resulting in data skew. Performance of a query depends on the performance of the slice storing the most data for the tables involved in a query.
- If you have distributed the data uniformly, process skew could still happen when not taking into account processing patterns. For e.g. data is distributed by month but if the process looks for a month of data, then the performance of the query will be degraded since the processing needs to be handled by a subset of SPUs and not all of the SPUs in the system.
- Performance gets impacted if a large volume of data (fact table) gets re-distributed or broadcast during query execution.
- Considerations to have numeric datatypes for join columns and where clause columns wherever possible so that the Zone maps can be used. Generating the zone maps for the ones that automatically create Zone maps – char, varchar etc.
- Table data that isn’t organized optimally for multi table joins as in the case of multi- dimensional joins performed in a data warehouse environment.
Steps to do query performance analysis
- Identify long running queries o Identify if queries are being queued and the long running queries which is causing the queries to be queued through NZAdmin tool or nzsession command. o Long running queries can also be identified if query history is being gathered using appropriate history configuration
- For long running queries generate query plans using the “EXPLAIN” command. Recent query execution plans are also stored in *.pln files in the data.<ver>/plans directory under the /nz/data directory.
- Look for some of the data points and reasons for performance issues details in the previous sections.
- Take necessary actions like generate statistics, change distributions, grooming tables, creation of materialized views, modifying or using organization keys, changing column data types or rewriting query.
- Verify whether the modifications helped with the query plan by regenerating the plan using the explain utility.
- If the analysis is performed in a non-development environment, it is key to make sure that the statistics reflect the values expected or in production.