pg stats notes

george24601 · Aug 19, 2024 · 6e07a66 · 6e07a66
1 parent 16daa17
commit 6e07a66
Showing 1 changed file with 119 additions and 0 deletions.
diff --git a/_posts/2024/2024-08-16-pg-stats.md b/_posts/2024/2024-08-16-pg-stats.md
@@ -0,0 +1,119 @@
+---
+layout: post
+title: "Notes on pg_stats"
+description: ""
+category: 
+tags: [paper, db]
+--- 
+
+* To collect statistics, the analyzer randomly selects 300 × default_statistics_target rows (the default value is 100, so 30,000 rows in total). 
+* Entries in pg_statistic are updated by the ANALYZE and VACUUM ANALYZE commands, and are always approximate even when freshly updated. After you complete an engine major version upgrade, you should run the ANALYZE operation to refresh the pg_statistic table
+
+### Statistics Used by the Planner
+
+#### SELECT * FROM tenk1
+
+* The planner determines the cardinality of tenk1
+
+```
+SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1';
+```
+
+#### WHERE unique1 < 1000
+
+* The planner examines the WHERE clause condition and looks up the selectivity function for the operator < in pg_operator. This is held in the column oprrest, and the entry in this case is scalarltsel. The scalarltsel function retrieves the histogram for unique1 from pg_statistic. For manual queries it is more convenient to look in the simpler pg_stats view:
+
+
+#### WHERE stringu1 = 'CRAAAA'
+
+* For equality estimation the histogram is not useful; instead the list of most common values (MCVs) is used to determine the selectivity. 
+
+
+#### WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2
+
+* The restriction on tenk1, unique1 < 50, is evaluated before the nested-loop join
+* The restriction for the join is t2.unique2 = t1.unique2. The operator is just our familiar =, however the selectivity function is obtained from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel looks up the statistical information for both tenk2 and tenk1, e.g., null_frac,n_distinct, most_common_vals
+
+```
+ selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2)
+rows = (outer_cardinality * inner_cardinality) * selectivity
+```
+
+
+
+
+
+
+
+
+
+
+
+
+
+### pg_stats and pg_statistics
+
+* Rather than look at pg_statistic directly, it's better to look at its view pg_stats when examining the statistics manually. pg_stats is designed to be more easily readable. Furthermore, pg_stats is readable by all, whereas pg_statistic is only readable by a superuser.
+* For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it).
+
+#### ALTER TABLE SET STATISTICS
+
+* sets the per-column statistics-gathering target for subsequent ANALYZE operations. The target can be set in the range 0 to 10000; alternatively, set it to -1 to revert to using the system default statistics target (default_statistics_target)
+
+#### correlation
+
+* When the value is near -1 or +1, an index scan on the column will be estimated to be cheaper than when it’s near 0
+
+#### most_common_vals(MCV) and most_common_freqs
+
+* MCV lists are also used for selectivity estimations of inequalities: to find the selectivity of "column < value", the planner searches most_common_vals for all the values lower than the given value, and then adds together their frequencies from most_common_freqs.
+* Common value statistics work best when the number of distinct values is low. The maximum size of the MCV arrays is defined by default_statistics_target, the same parameter that governs row sample size during analysis.
+
+#### Histogram
+
+* When the number of distinct values grows too large to store them all in an array, the system starts using the histogram representation. A histogram employs several buckets to store values in. The number of buckets is limited by the same default_statistics_target parameter.
+* histograms are used to estimate selectivity of "greater than" and "less than" operations together with MCV lists.
+
+#### n_distinct
+
+```sql
+ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...)
+```
+
+ the only defined per-attribute options are n_distinct and n_distinct_inherited, which override the number-of-distinct-values estimates made by subsequent ANALYZE operations. n_distinct affects the statistics for the table itself, while n_distinct_inherited affects the statistics gathered for the table plus its inheritance children. When set to a positive value, ANALYZE will assume that the column contains exactly the specified number of distinct nonnull values. When set to a negative value, which must be greater than or equal to -1, ANALYZE will assume that the number of distinct nonnull values in the column is linear in the size of the table; the exact count is to be computed by multiplying the estimated table size by the absolute value of the given number. For example, a value of -1 implies that all values in the column are distinct, while a value of -0.5 implies that each value appears twice on the average. This can be useful when the size of the table changes over time, since the multiplication by the number of rows in the table is not performed until query planning time. Specify a value of 0 to revert to estimating the number of distinct values normally.
+
+
+### pg_class
+
+* Basic relation-level statistics are stored in the table pg_class in the system catalog.
+* Relation's row count (reltuples).
+* Relation's size in pages (relpages).
+* Number of pages marked in the relation's visibility map (relallvisible).
+* The value reltuples = −1 (in PostgreSQL 14 and higher) helps us distinguish between a table that has never had statistics collected for it and a table that just doesn't have any rows.
+* relallvisible is used when estimating index-only scan cost
+
+
+### CREATE STATISTICS 
+
+* Extended statistics metadata is stored in the pg_statistic_ext table in the system catalog, while the statistics data itself is stored in a separate table pg_statistic_ext_data 
+* CREATE STATISTICS will create a new extended statistics object tracking data about the specified table, foreign table or materialized view. The statistics object will be created in the current database and will be owned by the user issuing the command.
+* The CREATE STATISTICS command has two basic forms. The first form allows univariate statistics for a single expression to be collected, providing benefits similar to an expression index without the overhead of index maintenance. This form does not allow the statistics kind to be specified, since the various statistics kinds refer only to multivariate statistics. The second form of the command allows multivariate statistics on multiple columns and/or expressions to be collected, optionally specifying which statistics kinds to include. This form will also automatically cause univariate statistics to be collected on any expressions included in the list.
+* If a schema name is given (for example, CREATE STATISTICS myschema.mystat ...) then the statistics object is created in the specified schema. Otherwise it is created in the current schema. The name of the statistics object must be distinct from the name of any other statistics object in the same schema.
+* Extended statistics are not currently used by the planner for selectivity estimations made for table joins. This limitation will likely be removed in a future version of PostgreSQL.
+* Creation of such an object merely creates a catalog entry expressing interest in the statistics. Actual data collection is performed by ANALYZE (either a manual command, or background auto-analyze). The collected values can be examined in the pg_statistic_ext_data catalog.
+* ANALYZE computes extended statistics based on the same sample of table rows that it takes for computing regular single-column statistics.
+
+
+#### Dependencies
+
+* By default, the statistics from ANALYZE are stored on a per-column per-table basis, and therefore can’t capture any knowledge about cross-column correlation. It’s common to see slow queries running bad run plans because multiple columns used in the query clauses are correlated. However, with the CREATE STATISTICS command, you can create extended statistics for correlated columns.
+
+
+
+### References
+
+* [Understanding statistics in PostgreSQL](https://aws.amazon.com/blogs/database/understanding-statistics-in-postgresql/)
+* [Queries in PostgreSQL: 2. Statistics](https://postgrespro.com/blog/pgsql/5969296)
+* [CREATE STATISTICS](https://www.postgresql.org/docs/16/sql-createstatistics.html)
+* [Statistics Used by the Planner](https://www.postgresql.org/docs/16/planner-stats.html)
+* [Row Estimation Examples](https://www.postgresql.org/docs/16/row-estimation-examples.html)