Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requests against paused container may result in infinite wait if component requests exists in future (ops only), not snapshot #3085

Closed
vladsud opened this issue Aug 6, 2020 · 6 comments
Assignees
Labels
api area: runtime Runtime related issues focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone

Comments

@vladsud
Copy link
Contributor

vladsud commented Aug 6, 2020

Teams thread

Quick recap:
In the scenario from the linked thread, the client was attempting to request a component which was not yet attached (in terms of sequence numbers), and while the container was paused. This results in a deadlock, because the attach op for that component was shortly after the summary.

The client knew how to request the component before it was attached, because it was copy-pasted. So the second client knew the request URL from external communication terms.

@vladsud vladsud added the bug Something isn't working label Aug 6, 2020
@vladsud vladsud added this to the August 2020 milestone Aug 6, 2020
@vladsud vladsud self-assigned this Aug 6, 2020
@ghost ghost added the triage label Aug 6, 2020
@arinwt
Copy link
Contributor

arinwt commented Aug 6, 2020

Should there be a synchronous way to check existence of component via request semantics? I don't know that allowing pause is fundamentally bad, it just requires caller to understand potential deadlock scenarios and be extra careful.

If they could check synchronously, it would provide a way to do the normal flow:

  • Check synchronously if it exists (given current seq number)
  • If yes, go through normal flow, i.e. await with timeout for perf optimization
  • If no, resume first to prevent guaranteed timeout

If we really wanted to go further with it, we could provide an API to listen for specific components being attached, etc. Not sure if that is the right path though.

@vladsud
Copy link
Contributor Author

vladsud commented Aug 6, 2020

@arinwt, do not pass request.headers.wait === true? Is that what you are looking for?
Fundamentally missing part can be anything, we should not focus on data store.

@arinwt
Copy link
Contributor

arinwt commented Aug 6, 2020

Oh I'm thinking about it wrong. It feels like the flow needs to be reworked a little when the client requests a specific component through the loader. If they do not already have the container loaded vs. if they do. Sort of seems like a two-phase thing.

Although the end-code maybe shouldn't look like this, this is my first understanding of maximum functionality coverage with single request semantics. i.e. we probably don't even need callback 1.

When directly requesting a component:

  1. callback 1: (containerAlreadyLoaded: boolean) => boolean // return should wait for container to load or not?
  2. callback 2: (componentAlreadyAttached: boolean) => boolean // return should wait for component to load or not?

^ above is a pretty complicated API... but I'm just putting out thoughts.

Alternatively, we can try do decide for them, take more control over the flow. In this case, pause is more of a preference than a hard rule (could also add a third option: pause: preferPause goes this flow, definitelyPause can reject/return without timeout, noPause already covered).
When requesting a component with pause set to true:

  1. load container as it currently works (paused)
  2. synchronously check if requested component is attached already
  3. if attached, return the loaded component async, if not attached resume and watch for the requested component until some timeout (OR current sequence number exceeds checkpointSequenceNumber?).

@curtisman
Copy link
Member

#2859 is related too.

@vladsud
Copy link
Contributor Author

vladsud commented Oct 13, 2020

For hosts using paused container loading flow (main flow), I'd love to experiment with following workflow:

  1. Host boot container and makes a request
  2. If request fails, host waits for container to connect, get up to date and repeats request.

This in my view is much better than what we have today, as it avoids infinite waits. PR #3830 creates a building block for that flow to be experimented with.

For hosts that do not use paused container loading flow, or for those that want # 2 above to resolve faster, we can improve the flow by waiting (racing) for either # 2 or channel (data store) being attached. I personally feel this we should not go that route - that's even more complexity (for what should be rather rare event), but also it's not very clear how it will compose with other features, like data stored being on channels, i.e. nested data stores. I'd rather see simple system, and improve only as we get data that we need to improve it further.

@vladsud vladsud modified the milestones: October 2020, November 2020 Oct 13, 2020
@curtisman curtisman added api area: runtime Runtime related issues and removed bug Something isn't working labels Oct 26, 2020
@vladsud vladsud modified the milestones: November 2020, December 2020 Nov 30, 2020
@vladsud vladsud added the focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone label Dec 15, 2020
@vladsud vladsud modified the milestones: December 2020, January 2021 Jan 6, 2021
@vladsud vladsud modified the milestones: February 2021, April 2021 Feb 26, 2021
@vladsud
Copy link
Contributor Author

vladsud commented Feb 26, 2021

Consolidating tracking of the work related to waits in #4508. Closing this issue.

@vladsud vladsud closed this as completed Feb 26, 2021
@danielroney danielroney removed this from the April 2021 milestone Mar 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api area: runtime Runtime related issues focus Items that engineers are focusing on now, but may not have any (coding) outcome in current milestone
Projects
None yet
Development

No branches or pull requests

4 participants