Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

api_skyportal: no_retry version #442

Merged
merged 1 commit into from
Oct 14, 2024

Conversation

Theodlz
Copy link
Collaborator

@Theodlz Theodlz commented Oct 9, 2024

create a no-retry version of the api_skyportal method, to avoid running into concurrency issues when the session retries sending a request to SkyPortal (because it took longer than the specified time out to respond to the client) when the initial one is still being processed. This is fine and can safely happen for pretty much everything, except follow-up requests.

If we tell SkyPortal to trigger an instrument, after 5 seconds without a response (the default timeout) decide to resent that request but SkyPortal was almost done (and is still) processing it, we might end up sending 2 requests at the same time and creating duplication issues.

With this PR, we can use much longer timeouts (here we try 30 seconds) while avoiding any retries when sending a follow-up request.

PS:
We can still run into a concurrency issue of course (where 2 alerts of the same object get processed at the same time, and worker B tries to trigger on alert 2 at the same time as Worker A is triggering on alert 1) and we already have logic in SkyPortal to avoid that, but if the distant server SkyPortal is sending the request to is taking too long is becomes a risk. In a future set of SkyPortal+Kowalski+Fritz PRs, we can consider posting the request in the DB in a "processing" state as soon as possible so that even before we start waiting for a distance server to answer, other processed can know that we are actively trying to send an identical request and they should cancel sending anything.

…ng into concurrency issues when the session retries sending a request to SkyPortal that might still be processing the first one. We use it paired with a longer timeout when sending followup requests
@Theodlz Theodlz self-assigned this Oct 9, 2024
@Theodlz Theodlz requested a review from mcoughlin October 9, 2024 15:39
@Theodlz
Copy link
Collaborator Author

Theodlz commented Oct 9, 2024

@nabeelre just so you know that issue is being addressed.

@Theodlz
Copy link
Collaborator Author

Theodlz commented Oct 9, 2024

we might want to keep using a shorter timeout though (maybe 10 s) to try to minimize these concurrently issues when 2 alerts are being triggered on at once (for the same object).

Copy link
Collaborator

@mcoughlin mcoughlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

"POST",
"/api/followup_request",
passed_filter["auto_followup"]["data"],
timeout=30,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jamie said last night's request took 42 seconds to process. With a 30 second timeout we'd mark the trigger as failed via timeout but it would actually succeed later? Maybe up this to 60 seconds to be conservative?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not quite sure where his number came from, since we got a response within 5 to 6 seconds, and definitely didn't wait 42 seconds. I would proceed with these numbers while we investigate this further w/ Jamie

@Theodlz Theodlz merged commit 1d08773 into skyportal:main Oct 14, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants