Stale metrics reported after shard changes state #22

luke-jarymowycz · 2018-12-13T15:36:27Z

Here is an example of stale metrics being reported for the old async (9bc94d6d), now sync of shard 4 in southeast. The old sync (11f64b89) of shard 4 stopped replicating when the server experienced a CPU fault and the async was promoted to sync. The old sync was then rebuilt as the async.

Prometheus link to the graph showing the old async (9bc94d6d) metrics being reported:

http://10.81.0.60:9090/graph?g0.range_input=1w&g0.end_input=2018-12-13%2002%3A59&g0.expr=pg_stat_replication_wal_sent_bytes%7Bsync_state%3D%22async%22%2Cbackend%3D~%224.postgres.ap-southeast.scloud.host-.*%22%7D%20-%20pg_stat_replication_replica_wal_replayed_bytes%7Bsync_state%3D%22async%22%2Cbackend%3D~%224.postgres.ap-southeast.scloud.host-.*%22%7D&g0.tab=0

sync_state=async is the label used to filter these results.

The text was updated successfully, but these errors were encountered:

KodyKantor · 2018-12-13T16:04:05Z

Above is a picture of the stale metric problem. About 2/3 of the way through the graph a second node came in also reporting as the async node.

KodyKantor · 2018-12-14T17:26:25Z

This is a tough problem to solve. In some ways pgstatsmon acts like a Prometheus proxy. It's implements some features that Prometheus also implements (service discovery, polling exporters).

One difference is that pgstatsmon always exports all of the metrics it has ever known about. Prometheus doesn't do that. In pre-2.0 Prometheus versions, Prometheus would 'expire' metrics from dead exporters (no longer being scraped) after five minutes. In 2.0+ Prometheus will expire metrics immediately once a scrape occurs and no series are retrieved (prometheus/prometheus#398).

Usually the Prometheus folks say that exporters should match the lifetime of the service they are monitoring. This is true for Muskie/Moray/CNAPI/SAPI/etc. metrics since the exporter lives in-memory with the service.

CMON has to handle this scenario as well. IIUC CMON maintains a cache of metrics that it last received from the exporter it's monitoring whether it be the cmon-agent in a GZ or something else. If an exporter disappears (a node goes offline and cmon-agent disappears) CMON will serve cached data for some amount of time, and then cease serving cached data, effectively 'expiring' the exporter.

The question is how pgstatsmon should handle this. Currently pgstatsmon will blindly export stale metrics in perpetuity. I think the approach that we could take to resolve this may take some modifications to both pgstatsmon and node-artedi. I'll have to investigate a few ideas I have to resolve this.

KodyKantor mentioned this issue Dec 13, 2018

Negative byte lag calculated from pgstatsmon metrics after shard rebuild #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stale metrics reported after shard changes state #22

Stale metrics reported after shard changes state #22

luke-jarymowycz commented Dec 13, 2018

KodyKantor commented Dec 13, 2018

KodyKantor commented Dec 14, 2018

Stale metrics reported after shard changes state #22

Stale metrics reported after shard changes state #22

Comments

luke-jarymowycz commented Dec 13, 2018

KodyKantor commented Dec 13, 2018

KodyKantor commented Dec 14, 2018