Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(#70): disable all non-custom Postgres metrics #79

Merged
merged 1 commit into from
Jul 28, 2023

Conversation

jkuester
Copy link
Collaborator

Closes #70

The goal of this PR is to minimize the number of metrics collected from Postgres. This will address 2 problems at the same time! First, the random log errors and other various querying issues when running against crazy old Postgres versions are being triggered when collecting the various default metrics that the postgres_exporter collects about the Postgres instance and its databases. Turning off these metrics should avoid the errors and quiet the logs. Secondly, when testing this against the prod Postgres instance, we observed that metric scraping was insanely slow (well above 10sec). Radically reducing the amount of metrics we are actually collecting from Postgres should help improve the scrape time!

Note that none of the metrics that are being removed were being used in any dashboards or alerts. They were all simply monitoring data that was focused on the Postgres instance itself (and not directly connected to monitoring the health of the CHT instance). Some of these metrics might be valuable to track in the future (to help understand the health of the Postgres instance), but they can be re-enabled when they are needed. Currently, when running against the fake instance, the postgres_exporter scrape returns 1153 lines of data in its response to each scrape. After these changes, the scrape response is only 20 lines (containing just our custom metrics and a few metrics directly connected to the performance of the exporter itself).

Regarding the config changes, it may look redundant to disable all the collectors individually in addition to setting --disable-default-metrics and --disable-settings-metrics. The main thing I can say here is that I tested various combinations and this was the way that ultimately resulted in the least amount of metrics being collected. Based on this comment I believe the postgres_exporter is being gradually being re-factored to use these "collectors" so that going forwards the disable-*-metrics flags will be removed, but for now the disable-*-metrics flags seem to disable the metrics that have not yet been refactored, while the no-collector flags disable to new-style metrics. 🤷

Testing considerations

The main thing to verify after these changes is that our custom couch2pg_progress_sequence metric is still being collected. This can be done by running Watchdog with the fake-cht configuration:

  • Set the development patches
  • Configure and deploy Watchdog with the fake-cht server
  • Open Grafana (http://localhost:3000/) and navigate to the CHT Admin Overview dashboard
  • Verify that data is displayed in the Couch2Pg Backlog panel (the value is dynamic, but it should not be No data)

Additionally, if you want to manually check the response data from the postgres_exporter (e.g. to do a before/after comparison) here is how:

  • Set the development patches
  • Configure and deploy Watchdog with the fake-cht server
  • Connect to the fake-cht docker container (just a convenient place with access to the docker network): docker exec -it cht-watchdog-fake-cht-1 bash
  • Run this curl command to simulate a scrape: curl "http://postgres-exporter:9187/probe?auth_module=postgres%3A5432%2Fcht&target=postgresql%3A%2F%2Fpostgres%3A5432%2Fcht"

@jkuester jkuester requested a review from m5r July 21, 2023 19:51
Copy link
Member

@m5r m5r left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The postgres-exporter container kept exiting with this log:

postgres_exporter: error: unknown long flag '--no-collector.postmaster', try --help

I must have had an old version of the postgres-exporter image because everything worked as expected after I ran docker pull prometheuscommunity/postgres-exporter:latest

Let's keep an eye on logs after this gets merged and deployed because I have a feeling we might encounter this on the server as well 😅

@lorerod
Copy link

lorerod commented Jul 26, 2023

@jkuester Just a couple of thoughts about this PR:

  • It's important to document the minimum versions of Postgres and Postgres-exporter required for use with cht-watchdog.
  • Since the removed metrics were not in use, there is no need to add a test to verify random log errors and other query issues when running against an outdated version of Postgres.
  • If we keep experiencing unexpected errors also while using the minimum version of Postgres or Postgres-exporter, would it be a good idea to test the suite or a part of it with those versions?
  • Thought for the future: performance testing for cht-watchdog?

@lorerod
Copy link

lorerod commented Jul 26, 2023

Run this curl command to simulate a scrape: curl "http://postgres-exporter:9187/probe?auth_module=postgres%3A5432%2Fcht&target=postgresql%3A%2F%2Fpostgres%3A5432%2Fcht"

The difference is big thank you for this improvement.

@jkuester
Copy link
Collaborator Author

Thanks for the feedback @lorerod!

It's important to document the minimum versions of Postgres and Postgres-exporter required for use with cht-watchdog.

Good call! I have logged medic/cht-docs#1143 to add documentation about the supported Postgres version. For the postgres-exporter things are a bit more tricky since we currently are not really tracking the version for the exporter (but just always testing against latest). I have logged #80 since, as you have noted, this may be a problem in the long-term.

If we keep experiencing unexpected errors also while using the minimum version of Postgres or Postgres-exporter, would it be a good idea to test the suite or a part of it with those versions?

This is a good point. Currently we do not have any automated tests that verify the postgres-exporter functionality. However, if/when those get added it would be valuable to run those tests against both the min version of Postgres and the latest.

For now, it is possible to manually test connecting Watchdog to the min version of Postgres by following these steps: #59 (review)

Thought for the future: performance testing for cht-watchdog?

Also a good thought (though probably not as high on the priority list yet). The good news is that the Allies Watchdog instance is configured to run against a large number of CHT instances as well as a large Postgres dataset. So, it provides a good manual check (it is how we identified this issue to begin with...)!

@jkuester jkuester changed the title feat(#70): disable all non-custom Postgres metrics fix(#70): disable all non-custom Postgres metrics Jul 28, 2023
@jkuester jkuester merged commit 2dcaea6 into main Jul 28, 2023
3 checks passed
@jkuester jkuester deleted the 70_disable_unnecessary_pg_metrics branch July 28, 2023 14:31
@medic-ci
Copy link
Collaborator

🎉 This PR is included in version 1.8.2 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

postgres/prometheus couch2pg errors: 'converting NULL to int64 is unsupported' and more
4 participants