fix(#70): disable all non-custom Postgres metrics #79

jkuester · 2023-07-21T19:51:41Z

Closes #70

The goal of this PR is to minimize the number of metrics collected from Postgres. This will address 2 problems at the same time! First, the random log errors and other various querying issues when running against crazy old Postgres versions are being triggered when collecting the various default metrics that the postgres_exporter collects about the Postgres instance and its databases. Turning off these metrics should avoid the errors and quiet the logs. Secondly, when testing this against the prod Postgres instance, we observed that metric scraping was insanely slow (well above 10sec). Radically reducing the amount of metrics we are actually collecting from Postgres should help improve the scrape time!

Note that none of the metrics that are being removed were being used in any dashboards or alerts. They were all simply monitoring data that was focused on the Postgres instance itself (and not directly connected to monitoring the health of the CHT instance). Some of these metrics might be valuable to track in the future (to help understand the health of the Postgres instance), but they can be re-enabled when they are needed. Currently, when running against the fake instance, the postgres_exporter scrape returns 1153 lines of data in its response to each scrape. After these changes, the scrape response is only 20 lines (containing just our custom metrics and a few metrics directly connected to the performance of the exporter itself).

Regarding the config changes, it may look redundant to disable all the collectors individually in addition to setting --disable-default-metrics and --disable-settings-metrics. The main thing I can say here is that I tested various combinations and this was the way that ultimately resulted in the least amount of metrics being collected. Based on this comment I believe the postgres_exporter is being gradually being re-factored to use these "collectors" so that going forwards the disable-*-metrics flags will be removed, but for now the disable-*-metrics flags seem to disable the metrics that have not yet been refactored, while the no-collector flags disable to new-style metrics. 🤷

Testing considerations

The main thing to verify after these changes is that our custom couch2pg_progress_sequence metric is still being collected. This can be done by running Watchdog with the fake-cht configuration:

Set the development patches
Configure and deploy Watchdog with the fake-cht server
Open Grafana (http://localhost:3000/) and navigate to the CHT Admin Overview dashboard
Verify that data is displayed in the Couch2Pg Backlog panel (the value is dynamic, but it should not be No data)

Additionally, if you want to manually check the response data from the postgres_exporter (e.g. to do a before/after comparison) here is how:

Set the development patches
Configure and deploy Watchdog with the fake-cht server
Connect to the fake-cht docker container (just a convenient place with access to the docker network): docker exec -it cht-watchdog-fake-cht-1 bash
Run this curl command to simulate a scrape: curl "http://postgres-exporter:9187/probe?auth_module=postgres%3A5432%2Fcht&target=postgresql%3A%2F%2Fpostgres%3A5432%2Fcht"

m5r

The postgres-exporter container kept exiting with this log:

postgres_exporter: error: unknown long flag '--no-collector.postmaster', try --help

I must have had an old version of the postgres-exporter image because everything worked as expected after I ran docker pull prometheuscommunity/postgres-exporter:latest

Let's keep an eye on logs after this gets merged and deployed because I have a feeling we might encounter this on the server as well 😅

lorerod · 2023-07-26T20:57:39Z

@jkuester Just a couple of thoughts about this PR:

It's important to document the minimum versions of Postgres and Postgres-exporter required for use with cht-watchdog.
Since the removed metrics were not in use, there is no need to add a test to verify random log errors and other query issues when running against an outdated version of Postgres.
If we keep experiencing unexpected errors also while using the minimum version of Postgres or Postgres-exporter, would it be a good idea to test the suite or a part of it with those versions?
Thought for the future: performance testing for cht-watchdog?

lorerod · 2023-07-26T21:12:57Z

Run this curl command to simulate a scrape: curl "http://postgres-exporter:9187/probe?auth_module=postgres%3A5432%2Fcht&target=postgresql%3A%2F%2Fpostgres%3A5432%2Fcht"

The difference is big thank you for this improvement.

jkuester · 2023-07-27T16:38:36Z

Thanks for the feedback @lorerod!

It's important to document the minimum versions of Postgres and Postgres-exporter required for use with cht-watchdog.

Good call! I have logged medic/cht-docs#1143 to add documentation about the supported Postgres version. For the postgres-exporter things are a bit more tricky since we currently are not really tracking the version for the exporter (but just always testing against latest). I have logged #80 since, as you have noted, this may be a problem in the long-term.

If we keep experiencing unexpected errors also while using the minimum version of Postgres or Postgres-exporter, would it be a good idea to test the suite or a part of it with those versions?

This is a good point. Currently we do not have any automated tests that verify the postgres-exporter functionality. However, if/when those get added it would be valuable to run those tests against both the min version of Postgres and the latest.

For now, it is possible to manually test connecting Watchdog to the min version of Postgres by following these steps: #59 (review)

Thought for the future: performance testing for cht-watchdog?

Also a good thought (though probably not as high on the priority list yet). The good news is that the Allies Watchdog instance is configured to run against a large number of CHT instances as well as a large Postgres dataset. So, it provides a good manual check (it is how we identified this issue to begin with...)!

medic-ci · 2023-07-28T14:33:58Z

🎉 This PR is included in version 1.8.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

feat: disable all non-custom postgres metrics

21d8546

jkuester requested a review from m5r July 21, 2023 19:51

m5r approved these changes Jul 24, 2023

View reviewed changes

jkuester mentioned this pull request Jul 27, 2023

Update Watchdog docs to include min version of Postgres medic/cht-docs#1143

Merged

jkuester changed the title ~~feat(#70): disable all non-custom Postgres metrics~~ fix(#70): disable all non-custom Postgres metrics Jul 28, 2023

jkuester merged commit 2dcaea6 into main Jul 28, 2023
3 checks passed

jkuester deleted the 70_disable_unnecessary_pg_metrics branch July 28, 2023 14:31

medic-ci added the released label Jul 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#70): disable all non-custom Postgres metrics #79

fix(#70): disable all non-custom Postgres metrics #79

jkuester commented Jul 21, 2023

m5r left a comment

lorerod commented Jul 26, 2023 •

edited

Loading

lorerod commented Jul 26, 2023

jkuester commented Jul 27, 2023

medic-ci commented Jul 28, 2023

fix(#70): disable all non-custom Postgres metrics #79

fix(#70): disable all non-custom Postgres metrics #79

Conversation

jkuester commented Jul 21, 2023

Testing considerations

m5r left a comment

Choose a reason for hiding this comment

lorerod commented Jul 26, 2023 • edited Loading

lorerod commented Jul 26, 2023

jkuester commented Jul 27, 2023

medic-ci commented Jul 28, 2023

lorerod commented Jul 26, 2023 •

edited

Loading