Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: The missing runner binary leads to CREATING_CONTAINER_ERROR after clearing /tmp #2051

Open
r4victor opened this issue Dec 3, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@r4victor
Copy link
Collaborator

r4victor commented Dec 3, 2024

Steps to reproduce

  1. Create an SSH fleet.
  2. Clear /tmp on the fleet instance.
  3. See runs fail with CREATING_CONTAINER_ERROR:
[11:49:54] WARNING  dstack._internal.server.background.tasks.process_running_jobs
                    :475 shim failed to execute job massive-pug-1-1-0:           
                    CREATING_CONTAINER_ERROR (createContainer error: Error       
                    response from daemon: invalid mount config for type "bind":  
                    bind source path does not exist:                             
                    /tmp/dstack-runner1354965780)

This is caused by the fact that the shim downloads the runner binary into /tmp at startup and then mounts the runner into the container. If the /tmp is cleared after the runner is downloaded, then mounting fails.

Actual behaviour

No response

Expected behaviour

No response

dstack version

master

Server logs

No response

Additional information

No response

@r4victor r4victor added the bug Something isn't working label Dec 3, 2024
@r4victor
Copy link
Collaborator Author

@un-def, any chance you get it fixed with the ongoing shim refactoring?

@un-def
Copy link
Collaborator

un-def commented Dec 23, 2024

@r4victor is there any reason why we download runner to /tmp on each shim (re)start? Couldn't we download runner once to a well-known location (e.g., /usr/local/bin/shim-runner, as with shim)?

@r4victor
Copy link
Collaborator Author

@un-def, I don't think there is a good reason to download the shim to /tmp – just a historical quirk. So the fix should be straightforward as you described. Although, there are some things to keep in mind like the runner should be re-downloaded on shim version update / SSH fleet re-creation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants