Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workaround for false positive Rawls healthcheck issue (SCP-5916) #2198

Closed
wants to merge 4 commits into from

Conversation

jlchang
Copy link
Contributor

@jlchang jlchang commented Jan 31, 2025

Issue: false positives checking on Rawls health

Starting 2025-01-31, Rawls health from https://api.firecloud.org/status was not "ok". Discussion with #dsp-core-services confirmed that the GoogleGenomics service is irrelevant now.
Current output from /status endpoint:
{"ok":false,"systems":{"Thurloe":{"ok":true},"Sam":{"ok":true},"Rawls":{"messages":["BillingProfileManager: CloudSQL: (ok: true, message: none), Sam: (ok: true, message: none)","GoogleGenomics: Timed out after 1 minute waiting for a response from GoogleGenomics"],"ok":false},"Agora":{"ok":true},"GoogleBuckets":{"ok":true},"LibraryIndex":{"ok":true},"OntologyIndex":{"ok":true}}}

External mitigation - Rawls to remove GoogleGenomics check

#dsp-core-services is removing that check from Rawls but recommend not using the /status endpoint to check for Rawls health but rather to look for 500 errors from actual requests. (see also SCP-5647)

Short-term mitigation for SCP - this PR

IF it takes a while for the Rawls change to take effect, we can use this PR to bypass rawls health checks until the change from #dsp-core-services is in production. If the status endpoint has updated (ie. you can access study settings tab in production), this PR is no longer needed.

MANUAL TESTING

  1. Boot all services and sign in
  2. Visit a study where you have edit privileges and confirm the "Settings" tab is active (currently in production, the tab is greyed out and you'll see a tooltip with "Study workspaces are currently unavailable - please try again later".
  3. Click on the "+ Create study" link and confirm you're directed to the study creation form.

@jlchang jlchang requested a review from eweitz January 31, 2025 18:04
Copy link

codecov bot commented Jan 31, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.68%. Comparing base (a29b214) to head (0621811).

Additional details and impacted files

Impacted file tree graph

@@               Coverage Diff               @@
##           development    #2198      +/-   ##
===============================================
- Coverage        70.15%   69.68%   -0.48%     
===============================================
  Files              332      332              
  Lines            28240    28240              
  Branches          2377     2377              
===============================================
- Hits             19812    19679     -133     
- Misses            8281     8414     +133     
  Partials           147      147              
Files with missing lines Coverage Δ
app/controllers/site_controller.rb 40.09% <100.00%> (ø)
app/controllers/studies_controller.rb 24.24% <100.00%> (-0.16%) ⬇️

... and 16 files with indirect coverage changes

Copy link
Member

@eweitz eweitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good! My local manual tests passed.

I recommend a correction in some code comments, nothing blocking.

@@ -666,7 +666,10 @@ def set_firecloud_permissions(study_detached)
if api_status.is_a?(Hash)
system_status = api_status['systems']
sam_ok = system_status.dig(FireCloudClient::SAM_SERVICE, 'ok') == true # do equality check in case 'ok' node isn't present
rawls_ok = system_status.dig(FireCloudClient::RAWLS_SERVICE, 'ok') == true
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current phrasing is subtly but importantly incorrect. For reference:

  • False positive: non-problem reported as problem
  • False negative: problem reported as non-problem
  • True positive: problem reported as problem
  • True negative: non-problem reported as non-problem

The issue is that Rawls is effectively OK but is reported as not-OK, so the status report is a false positive.

Suggested change
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue
# 2025-01-31 quick and dirty workaround for false positive Rawls availability issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @eweitz. I always think this through the wrong way - ie. Rawls was reporting it was "not OK" (ie. a negative that was incorrect)and so I end up using the wrong term. Updated in 1cbcc606f.

@@ -1148,7 +1148,10 @@ def check_edit_permissions

# check on FireCloud API status and respond accordingly
def check_firecloud_status
unless ApplicationController.firecloud_client.services_available?(FireCloudClient::SAM_SERVICE, FireCloudClient::RAWLS_SERVICE)
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# 2025-01-31 quick and dirty workaround for false negative Rawls availability issue
# 2025-01-31 quick and dirty workaround for false positive Rawls availability issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 1cbcc606f.

@eweitz eweitz changed the title Workaround for false positive rawls healthcheck issue Workaround for false positive rawls healthcheck issue (SCP-5916) Jan 31, 2025
@eweitz eweitz changed the title Workaround for false positive rawls healthcheck issue (SCP-5916) Workaround for false positive Rawls healthcheck issue (SCP-5916) Jan 31, 2025
@jlchang
Copy link
Contributor Author

jlchang commented Jan 31, 2025

#dsp-core-services updated Rawls to stop checking GoogleGenomics which resolves our issue so we don't need to consider this workaround anymore.

@jlchang jlchang closed this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants