Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensure reproducibility and simplify the process of loading old jobs #287

Open
arik-shurygin opened this issue Oct 24, 2024 · 0 comments
Open
Labels
enhancement New feature or request enterprise_practices Governance housekeeping trying to keep projects usable and secure.

Comments

@arik-shurygin
Copy link
Collaborator

as the title suggests a user that wants to reproduce the output of a particular job currently follows the following steps:

  1. navigate to the scenarios-mechanistic-input blob, find the job in question and download the associated exp directory with state config files.
  2. look at the time/date that the job was launched, clone the repo to that recent commit (if there were no uncommitted changes). If the job used a unique docker tag which was not reused they could also docker pull the image from the container registry (requires login and authentication via a custom process)
  3. relaunch the azure job, or launch the image locally and custom mount local input and output mounts containing the data/ folder and exp/ folder respectively.

This process is super rough and not even fully reproducible. If the job was launched with uncommitted changes to the src directory, AND the docker image tag was subsequently reused, that data is lost forever. Full reproducibility would come down to memory of what changes occured.

This is obviously a problem and needs a solution. I propose a couple of different routes to solving this:

Solution A (my preferred): Improve docker images, build automated pipeline for gathering input mount and container

  • My preferred solution is to upload docker images with tags containing their unique image hash. Meaning docker images that are exactly equal to one another share a tag, and every unique docker image is uploaded to the container registry. We could then build a script that looks up a job in Azure batch and relaunch it with guarantees that the underlying input mount AND image are unchanged. If users want to download the image and input data locally they can still do that via docker pull with guarantees that the image has not been overridden by future jobs (because the hashes still line up)
  • PROS
    • relaunching on azure would be a single click process (hopefully)
    • docker images no longer overridden
  • CONS
    • unknown constraints on size of the container registry may become an issue when uploading many (nearly) identical images to the CR.
    • philosophical issue on whether mounting input data is the correct way to do things still remains.

Solution B: mount everything.

  • Rather than saving the src directory into docker image, we instead mount it just like we do with the experiments currently.
  • PROS:
    • because job_ids are unique and the input mount saves each job separately, we would not lose any data, even if the changes to src were uncommitted
    • text data is very space efficient so our input mount would never grow very large
  • CONS:
    • Flies in the face of dockers isolated nature
    • Could still face issues with package versioning of different docker images changing under your feet, potentially causing issues (although rare)

Solution C: remove ability to push images with uncommitted src changes

  • The only thing stopping users from cloning the repo from a particular time is the possibility that uncommitted changes leaked into the src directory at the time the experiment was launched, if we disallow this, then our current flow would work just fine.
  • PROS
    • Minimal changes to existing architecture, easiest to get off the ground
  • CONS
    • Can get in the way during a time of rapid parallel experimentation (launching multiple experiments at once, some of them changing src some of them not)
    • Still a relatively clunky pipeline for reproduction that could be improved
@arik-shurygin arik-shurygin added enhancement New feature or request enterprise_practices Governance housekeeping trying to keep projects usable and secure. labels Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request enterprise_practices Governance housekeeping trying to keep projects usable and secure.
Projects
None yet
Development

No branches or pull requests

1 participant