Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stale metrics reported after shard changes state #22

Open
luke-jarymowycz opened this issue Dec 13, 2018 · 2 comments
Open

Stale metrics reported after shard changes state #22

luke-jarymowycz opened this issue Dec 13, 2018 · 2 comments

Comments

@luke-jarymowycz
Copy link

Here is an example of stale metrics being reported for the old async (9bc94d6d), now sync of shard 4 in southeast. The old sync (11f64b89) of shard 4 stopped replicating when the server experienced a CPU fault and the async was promoted to sync. The old sync was then rebuilt as the async.

Prometheus link to the graph showing the old async (9bc94d6d) metrics being reported:

http://10.81.0.60:9090/graph?g0.range_input=1w&g0.end_input=2018-12-13%2002%3A59&g0.expr=pg_stat_replication_wal_sent_bytes%7Bsync_state%3D%22async%22%2Cbackend%3D~%224.postgres.ap-southeast.scloud.host-.*%22%7D%20-%20pg_stat_replication_replica_wal_replayed_bytes%7Bsync_state%3D%22async%22%2Cbackend%3D~%224.postgres.ap-southeast.scloud.host-.*%22%7D&g0.tab=0

sync_state=async is the label used to filter these results.

@KodyKantor
Copy link
Contributor

stale_lag

Above is a picture of the stale metric problem. About 2/3 of the way through the graph a second node came in also reporting as the async node.

@KodyKantor
Copy link
Contributor

This is a tough problem to solve. In some ways pgstatsmon acts like a Prometheus proxy. It's implements some features that Prometheus also implements (service discovery, polling exporters).

One difference is that pgstatsmon always exports all of the metrics it has ever known about. Prometheus doesn't do that. In pre-2.0 Prometheus versions, Prometheus would 'expire' metrics from dead exporters (no longer being scraped) after five minutes. In 2.0+ Prometheus will expire metrics immediately once a scrape occurs and no series are retrieved (prometheus/prometheus#398).

Usually the Prometheus folks say that exporters should match the lifetime of the service they are monitoring. This is true for Muskie/Moray/CNAPI/SAPI/etc. metrics since the exporter lives in-memory with the service.

CMON has to handle this scenario as well. IIUC CMON maintains a cache of metrics that it last received from the exporter it's monitoring whether it be the cmon-agent in a GZ or something else. If an exporter disappears (a node goes offline and cmon-agent disappears) CMON will serve cached data for some amount of time, and then cease serving cached data, effectively 'expiring' the exporter.

The question is how pgstatsmon should handle this. Currently pgstatsmon will blindly export stale metrics in perpetuity. I think the approach that we could take to resolve this may take some modifications to both pgstatsmon and node-artedi. I'll have to investigate a few ideas I have to resolve this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants