-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler unavailability should not impact cache operations #220
Comments
I suspect that what you’re seeing is that GetCapabilities() calls fail. Those need to merge properties returned by both the storage nodes and scheduler process. It’s also hard to cache/memoize these, as they depend on the credentials of the user. |
Sure, but can we have it be that the call returns the equivalent of |
As in, announce that the cluster supports remote caching? No, because that would cause flakiness if people try to do builds that only use remote execution without local fallback. |
If they don’t have local fallback enabled but they do have remote executor specified, wouldn’t the CLI simply error that the endpoint doesn’t support RBE and then fail the build? |
Exactly. And that’s bad, because under the current model it’s possible to set —remote_retries sufficiently high, causing Bazel to simply wait for the scheduler to come online and run the build to completion. |
True, but we have to weight that against the remote cache being completely inaccessible to everyone for that duration as a penalty. Maybe this should be a configuration option, then? Fail open with scheduler unavailability vs not? |
If we know the configuration of the scheduler, it should be possible to implement configuration of its capabilities straight in the frontend. |
The scheduler is such a simple process to operate, I don’t see the value in that to be honest. Just run health checking against it and make sure it gets launched elsewhere if your server fails. |
In the scenario where
bb-storage
frontend is pointing to both a remote cache and abb-scheduler
instance, if the scheduler suddenly goes down, the entire frontend instance essentially becomes crippled. However, cache actions should be totally unaffected by the scheduler's availability (as in the case where a customer passes--remote_cache
but not--remote_executor
).Can we make unavailability of the scheduler a log in the console for cache API calls, while still returning an error for remote execution API calls?
The text was updated successfully, but these errors were encountered: