Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cht-sync collector and update the sql exporter example #111

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

njuguna-n
Copy link
Contributor

Add a collector for CHT Sync

@njuguna-n
Copy link
Contributor Author

@mrjones-plip please have another look now that the docs PR is merged.

@mrjones-plip
Copy link
Contributor

I went to go test and started by setting up all inclusive CHT Sync and hit a snag.

however, now I realize I don't have a good way of putting data into that stand alone couchdb instance, so maybe I'll try using an existing docker helper instance to test.

stay tuned!

Copy link
Contributor

@mrjones-plip mrjones-plip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I totally think this should work - I can't get to to work following the new docs we just published :/

I keep on getting the error in the cht-watchdog-sql_exporter-1 service:

ts=2024-09-05T21:14:26.875Z caller=klog.go:157 level=error func=Fatalf msg="Error creating exporter: unknown collector \"cht-sync\" referenced in job \"db_targets\""

Checking, this looks right as the collectors here matches where we define the collector.

renaming the collector file to exporters/postgres/cht-sync_collector.yml didn't help, and changing everything to either cht_sync or chtsync didn't fix it either 🤷

Happy to pair with you if wanna see what I see!

@njuguna-n
Copy link
Contributor Author

@mrjones-plip I needed to update collector_files as well. Give it another try now.

@@ -29,4 +29,3 @@ jobs:
# sql server (eg "postgres-rds-prod", "postgres-rds-dev1" etc.)
# 'postgres://USERNAME:PASSWORD@DB_SERVER_IP/DATABASE:PORT
"db1": 'postgres://cht_couch2pg:[email protected]:5432/cht?sslmode=disable' # //NOSONAR - password is safe to commit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ever going to be cht database? I think, for the near term, it's always going to be data yeah? I got this error initially:

ts=2024-10-01T22:28:33.507Z caller=klog.go:134 level=error func=Errorf msg="Error gathering metrics: [from Gatherer #1] [job=db_targets,target=db1] pq: database \"cht\" does not exist"

So I think this is the fix. Also, since we default to postgres as the user, I think we should use that here too, yeah? Finally - I think we can further demote couch2pg by just using password:

Suggested change
"db1": 'postgres://cht_couch2pg:cht_couch2pg_password@172.17.0.1:5432/cht?sslmode=disable' # //NOSONAR - password is safe to commit
"db1": 'postgres://postgres:password@172.17.0.1:5432/code?sslmode=disable' # //NOSONAR - password is safe to commit

Copy link
Contributor

@mrjones-plip mrjones-plip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for circling back on this!

I gave suggestion to ease how much the user has to work (what is the DB name? cht? data ? other?), but also am getting this error now:

ts=2024-10-01T22:29:22.502Z caller=klog.go:134 level=error func=Errorf msg="Error gathering metrics: [from Gatherer #1] [job=db_targets,target=db1,collector=cht-sync,query=couch2pg-query] pq: syntax error at or near \"{\""

Steps to reproduce were:

  1. set up docker helper cht core instance
  2. set up bare bones watchdog to monitor that instance
  3. set up cht sync to monitor docker helper instance
  4. follow the steps to update watchdog to monitor CHT sync
  5. edit sql_servers.yml to have this targets::
    "db1": 'postgres://postgres:[email protected]:5432/data?sslmode=disable'
    
  6. run compose command: docker compose -f docker-compose.yml -f exporters/postgres/compose.yml up -d
  7. check logs with docker logs cht-watchdog-sql_exporter-1

@njuguna-n
Copy link
Contributor Author

@mrjones-plip the error is being caused by how we define the queries with environment variables. I have tried to update it into something like ${POSTGRES_SCHEMA} and then using the envsubst command to replace the variables but updating the run command to command: ["/bin/sh", "-c", "envsubst < /etc/sql_exporter/cht_sync_collector.yml > /tmp/sql_exporter/cht_sync_collector.yml && sql_exporter --config /etc/sql_exporter/cht_sync_collector.yml"] and different variations of this command was giving me errors.

A simple solution is to hard code the default variables as they are defined in CHT Sync and update the documentation to reflect that these should be changed if they are different. What do you think of this approach?

@mrjones-plip
Copy link
Contributor

Sorry this took so long to get back to! I didn't even read it and thought I'd need to set up my who test suite - it was just a question 😭 - I'll do better next time!

So I see that we have hard coded variables (or simply no database/schema names?) in the dbt-latency query. As well, the old couch2pg query doesn't specify schema. Can we just not specify it and have it implicitly use the right one based off the target definition? I suspect my naïveté around postgres specifics is showing here 😅

If we must hard code a deeply linked v1 schema in the sql_servers_example.yml file, ultimately, I think this is fine. I suspect this won't change that often (<1/yr) and we can handle migrations from one schema to another as needed.

@njuguna-n
Copy link
Contributor Author

Can we just not specify it and have it implicitly use the right one based off the target definition?

Failing to specify the schema would default the queries to running in the public schema which would almost certainly be wrong. I would rather we have the hard-coded schema but make sure we specify in the docs that it should be changed should it be different.

@njuguna-n
Copy link
Contributor Author

@mrjones-plip I have added the hard-coded schema and a comment stating the values should be updated as needed. Please review.

@mrjones-plip
Copy link
Contributor

sounds good! I'll get to this review today for sure.

@mrjones-plip
Copy link
Contributor

oh no! I might not get to this 'til tomorrow... I'll keep ya posted.

Copy link
Contributor

@mrjones-plip mrjones-plip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK!!! We're much closer. Thanks for pushing through on this!

I'm still seeing some errors though - see my deep dive one comment.

EXTRACT(EPOCH FROM(couchdb.latest - dbt_root.latest)) AS dbt_latency
FROM
(SELECT MAX(saved_timestamp) as latest FROM v1.document_metadata) dbt_root,
(SELECT MAX(saved_timestamp) as latest FROM v1.couchdb) couchdb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm getting an error:

ts=2024-10-16T22:22:52.357Z caller=klog.go:134 level=error func=Errorf msg="Error gathering metrics: [from Gatherer #1] [job=db_targets,target=db1,collector=cht-sync,query=dbt-latency] pq: relation \"v1.couchdb\" does not exist"
ts=2024-10-16T22:22:53.357Z caller=klog.go:134 level=error func=Errorf msg="Error gathering metrics: [from Gatherer #1] [job=db_targets,target=db1,collector=cht-sync,query=dbt-latency] pq: relation \"v1.couchdb\" does not exist"
ts=2024-10-16T22:22:54.356Z caller=klog.go:134 level=error func=Errorf msg="Error gathering metrics: [from Gatherer #1] [job=db_targets,target=db1,collector=cht-sync,query=dbt-latency] pq: relation \"v1.couchdb\" does not exist"

Doing some digging in the actual data in my DB, this might work?

Suggested change
(SELECT MAX(saved_timestamp) as latest FROM v1.couchdb) couchdb
(SELECT max(v1.couchdb_progress.updated_at) as latest FROM v1.couchdb_progress) couchdb

Checking prometheus targets at http://localhost:9090/targets?search= I see http://sql_exporter:9399/metrics marked as UP, and checking that URL I see metrics I'd expect:

# HELP couch2pg_progress_pending approximate number of changes left to sync from couch to postgres
# TYPE couch2pg_progress_pending gauge
couch2pg_progress_pending{cht_instance="192-168-68-26.local-ip.medicmobile.org",db="medic",job="db_targets",target="db1"} 0
# HELP couch2pg_progress_sequence current sequence number for couch2pg
# TYPE couch2pg_progress_sequence counter
couch2pg_progress_sequence{cht_instance="192-168-68-26.local-ip.medicmobile.org",db="medic",job="db_targets",target="db1"} 221
# HELP couch2pg_up 1 if couch2pg is running and has updated in the last minute, 0 if not
# TYPE couch2pg_up gauge
couch2pg_up{cht_instance="192-168-68-26.local-ip.medicmobile.org",db="medic",job="db_targets",target="db1"} 1
# HELP dbt_execution_time dbt run last execution time (ms)
# TYPE dbt_execution_time gauge
dbt_execution_time{job="db_targets",table_name="contact",target="db1"} 0.09859967231750488
dbt_execution_time{job="db_targets",table_name="contact_type",target="db1"} 0.07459473609924316
dbt_execution_time{job="db_targets",table_name="data_record",target="db1"} 0.10951995849609375
dbt_execution_time{job="db_targets",table_name="dbt_results",target="db1"} 0.23592472076416016
dbt_execution_time{job="db_targets",table_name="document_metadata",target="db1"} 0.2093191146850586
dbt_execution_time{job="db_targets",table_name="patient",target="db1"} 0.0865786075592041
dbt_execution_time{job="db_targets",table_name="person",target="db1"} 0.10456228256225586
dbt_execution_time{job="db_targets",table_name="place",target="db1"} 0.16501379013061523
dbt_execution_time{job="db_targets",table_name="user",target="db1"} 0.0803828239440918
# HELP dbt_latency difference between last timestamp in dbt models and current time (seconds)
# TYPE dbt_latency gauge
dbt_latency{job="db_targets",target="db1"} 1180.199879
# HELP scrape_duration_seconds How long it took to scrape the target in seconds
# TYPE scrape_duration_seconds gauge
scrape_duration_seconds{job="db_targets",target="db1"} 0.000738378
# HELP up 1 if the target is reachable, or 0 if the scrape failed
# TYPE up gauge
up{job="db_targets",target="db1"} 1

that all said, my couch2pg backlog panel in Grafana shows no data:

image

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrjones-plip I suspect the first error you got is because your POSTGRES_TABLE env variable was set to medic but that was updated to couchdb in this commit. I have managed to replicate the No data issue but I don't know why since the SQL query works as expected. I have booked a 30 minute session in your calendar to go through this together and hopefully get this over the line.

Copy link
Contributor

@mrjones-plip mrjones-plip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks thanks THANKS! for the big push to see this through - really good work here.

we got it all working and added the JSON to the PR

approved !

@njuguna-n
Copy link
Contributor Author

I really appreciate the time you made to review this in a call. Your attention to detail and making sure it works as expected led us to uncover a few issues that we can fix to make this better.

@njuguna-n njuguna-n merged commit b31f6b0 into main Oct 18, 2024
4 checks passed
@njuguna-n njuguna-n deleted the add-cht-sync-collector branch October 18, 2024 08:17
@medic-ci
Copy link
Collaborator

🎉 This PR is included in version 1.17.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants