diff --git a/_posts/2024/2024-08-16-pg-stats.md b/_posts/2024/2024-08-16-pg-stats.md index 31f8ef3..96dac62 100644 --- a/_posts/2024/2024-08-16-pg-stats.md +++ b/_posts/2024/2024-08-16-pg-stats.md @@ -19,41 +19,29 @@ tags: [paper, db] SELECT relpages, reltuples FROM pg_class WHERE relname = 'tenk1'; ``` -#### WHERE unique1 < 1000 - -* The planner examines the WHERE clause condition and looks up the selectivity function for the operator < in pg_operator. This is held in the column oprrest, and the entry in this case is scalarltsel. The scalarltsel function retrieves the histogram for unique1 from pg_statistic. For manual queries it is more convenient to look in the simpler pg_stats view: - +* Basic relation-level statistics are stored in the table pg_class in the system catalog. +* reltuples: Relation's row count +* relpages: Relation's size in pages -#### WHERE stringu1 = 'CRAAAA' +#### WHERE unique1 < 1000 -* For equality estimation the histogram is not useful; instead the list of most common values (MCVs) is used to determine the selectivity. +* The planner examines the WHERE clause condition and looks up the selectivity function for the operator < in pg_operator. This is held in the column oprrest, and the entry in this case is scalarltsel. The scalarltsel function retrieves the `histogram` for unique1 from pg_statistic. +* For equality estimation the histogram is not useful; instead the list of `most common values (MCVs)` is used to determine the selectivity. #### WHERE t1.unique1 < 50 AND t1.unique2 = t2.unique2 * The restriction on tenk1, unique1 < 50, is evaluated before the nested-loop join -* The restriction for the join is t2.unique2 = t1.unique2. The operator is just our familiar =, however the selectivity function is obtained from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel looks up the statistical information for both tenk2 and tenk1, e.g., null_frac,n_distinct, most_common_vals +* The restriction for the join is t2.unique2 = t1.unique2. The operator is just our familiar =, however the selectivity function is obtained from the oprjoin column of pg_operator, and is eqjoinsel. eqjoinsel looks up the statistical information for both tenk2 and tenk1, e.g., `null_frac,n_distinct, most_common_vals` ``` selectivity = (1 - null_frac1) * (1 - null_frac2) * min(1/num_distinct1, 1/num_distinct2) rows = (outer_cardinality * inner_cardinality) * selectivity ``` +### pg_stats - - - - - - - - - - - -### pg_stats and pg_statistics - -* Rather than look at pg_statistic directly, it's better to look at its view pg_stats when examining the statistics manually. pg_stats is designed to be more easily readable. Furthermore, pg_stats is readable by all, whereas pg_statistic is only readable by a superuser. +* pg_stats is designed to be more easily readable and is readable by all, whereas pg_statistic is only readable by a superuser. * For a read replica in Amazon RDS for PostgreSQL and for a reader node in Aurora PostgreSQL, these stats are the same as for the primary or writer. This is because they are stored in a relation (pg_statistics) on disk (physical blocks are the same on the replica in Amazon RDS for PostgreSQL and in the case of Aurora, the reader is reading from the same storage). This is also the reason why it isn’t allowed (and also not logical) to run an ANALYZE on a replica or a reader node (both can read from the pg_statistics relation, but can’t update it). #### ALTER TABLE SET STATISTICS @@ -80,17 +68,23 @@ rows = (outer_cardinality * inner_cardinality) * selectivity ALTER TABLE ... ALTER COLUMN ... SET (n_distinct = ...) ``` - the only defined per-attribute options are n_distinct and n_distinct_inherited, which override the number-of-distinct-values estimates made by subsequent ANALYZE operations. n_distinct affects the statistics for the table itself, while n_distinct_inherited affects the statistics gathered for the table plus its inheritance children. When set to a positive value, ANALYZE will assume that the column contains exactly the specified number of distinct nonnull values. When set to a negative value, which must be greater than or equal to -1, ANALYZE will assume that the number of distinct nonnull values in the column is linear in the size of the table; the exact count is to be computed by multiplying the estimated table size by the absolute value of the given number. For example, a value of -1 implies that all values in the column are distinct, while a value of -0.5 implies that each value appears twice on the average. This can be useful when the size of the table changes over time, since the multiplication by the number of rows in the table is not performed until query planning time. Specify a value of 0 to revert to estimating the number of distinct values normally. + the only defined per-attribute options are n_distinct and n_distinct_inherited, which override the number-of-distinct-values estimates made by subsequent ANALYZE operations, i.e., the n_distinct change will not be in effect until you run ANALYZE again -### pg_class +#### How is pg_stats build -* Basic relation-level statistics are stored in the table pg_class in the system catalog. -* Relation's row count (reltuples). -* Relation's size in pages (relpages). -* Number of pages marked in the relation's visibility map (relallvisible). -* The value reltuples = −1 (in PostgreSQL 14 and higher) helps us distinguish between a table that has never had statistics collected for it and a table that just doesn't have any rows. -* relallvisible is used when estimating index-only scan cost +```sql + FROM (((pg_statistic s + JOIN pg_class c ON ((c.oid = s.starelid))) + JOIN pg_attribute a ON (((c.oid = a.attrelid) AND (a.attnum = s.staattnum)))) + LEFT JOIN pg_namespace n ON ((n.oid = c.relnamespace))) +``` + +* tablename: pg_class.relname +* attname: from pg_attribute, identified by oid + attnum +* null_frac: stanullfrac +* avg_width: stawidth +* n_distinct: stadistinct ### CREATE STATISTICS