-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workaround for false positive Rawls healthcheck issue (SCP-5916) #2198
Conversation
set rawls_ok to always be true
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## development #2198 +/- ##
===============================================
- Coverage 70.15% 69.68% -0.48%
===============================================
Files 332 332
Lines 28240 28240
Branches 2377 2377
===============================================
- Hits 19812 19679 -133
- Misses 8281 8414 +133
Partials 147 147
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good! My local manual tests passed.
I recommend a correction in some code comments, nothing blocking.
app/controllers/site_controller.rb
Outdated
@@ -666,7 +666,10 @@ def set_firecloud_permissions(study_detached) | |||
if api_status.is_a?(Hash) | |||
system_status = api_status['systems'] | |||
sam_ok = system_status.dig(FireCloudClient::SAM_SERVICE, 'ok') == true # do equality check in case 'ok' node isn't present | |||
rawls_ok = system_status.dig(FireCloudClient::RAWLS_SERVICE, 'ok') == true | |||
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current phrasing is subtly but importantly incorrect. For reference:
- False positive: non-problem reported as problem
- False negative: problem reported as non-problem
- True positive: problem reported as problem
- True negative: non-problem reported as non-problem
The issue is that Rawls is effectively OK but is reported as not-OK, so the status report is a false positive.
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue | |
# 2025-01-31 quick and dirty workaround for false positive Rawls availability issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @eweitz. I always think this through the wrong way - ie. Rawls was reporting it was "not OK" (ie. a negative that was incorrect)and so I end up using the wrong term. Updated in 1cbcc606f
.
@@ -1148,7 +1148,10 @@ def check_edit_permissions | |||
|
|||
# check on FireCloud API status and respond accordingly | |||
def check_firecloud_status | |||
unless ApplicationController.firecloud_client.services_available?(FireCloudClient::SAM_SERVICE, FireCloudClient::RAWLS_SERVICE) | |||
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue | |
# 2025-01-31 quick and dirty workaround for false positive Rawls availability issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in 1cbcc606f
.
#dsp-core-services updated Rawls to stop checking GoogleGenomics which resolves our issue so we don't need to consider this workaround anymore. |
Issue: false positives checking on Rawls health
Starting 2025-01-31, Rawls health from https://api.firecloud.org/status was not "ok". Discussion with #dsp-core-services confirmed that the GoogleGenomics service is irrelevant now.
Current output from
/status
endpoint:{"ok":false,"systems":{"Thurloe":{"ok":true},"Sam":{"ok":true},"Rawls":{"messages":["BillingProfileManager: CloudSQL: (ok: true, message: none), Sam: (ok: true, message: none)","GoogleGenomics: Timed out after 1 minute waiting for a response from GoogleGenomics"],"ok":false},"Agora":{"ok":true},"GoogleBuckets":{"ok":true},"LibraryIndex":{"ok":true},"OntologyIndex":{"ok":true}}}
External mitigation - Rawls to remove GoogleGenomics check
#dsp-core-services is removing that check from Rawls but recommend not using the /status endpoint to check for Rawls health but rather to look for
500
errors from actual requests. (see also SCP-5647)Short-term mitigation for SCP - this PR
IF it takes a while for the Rawls change to take effect, we can use this PR to bypass rawls health checks until the change from #dsp-core-services is in production. If the status endpoint has updated (ie. you can access study settings tab in production), this PR is no longer needed.
MANUAL TESTING