ensure reproducibility and simplify the process of loading old jobs #287

arik-shurygin · 2024-10-24T18:28:41Z

as the title suggests a user that wants to reproduce the output of a particular job currently follows the following steps:

navigate to the scenarios-mechanistic-input blob, find the job in question and download the associated exp directory with state config files.
look at the time/date that the job was launched, clone the repo to that recent commit (if there were no uncommitted changes). If the job used a unique docker tag which was not reused they could also docker pull the image from the container registry (requires login and authentication via a custom process)
relaunch the azure job, or launch the image locally and custom mount local input and output mounts containing the data/ folder and exp/ folder respectively.

This process is super rough and not even fully reproducible. If the job was launched with uncommitted changes to the src directory, AND the docker image tag was subsequently reused, that data is lost forever. Full reproducibility would come down to memory of what changes occured.

This is obviously a problem and needs a solution. I propose a couple of different routes to solving this:

Solution A (my preferred): Improve docker images, build automated pipeline for gathering input mount and container

My preferred solution is to upload docker images with tags containing their unique image hash. Meaning docker images that are exactly equal to one another share a tag, and every unique docker image is uploaded to the container registry. We could then build a script that looks up a job in Azure batch and relaunch it with guarantees that the underlying input mount AND image are unchanged. If users want to download the image and input data locally they can still do that via docker pull with guarantees that the image has not been overridden by future jobs (because the hashes still line up)
PROS
- relaunching on azure would be a single click process (hopefully)
- docker images no longer overridden
CONS
- unknown constraints on size of the container registry may become an issue when uploading many (nearly) identical images to the CR.
- philosophical issue on whether mounting input data is the correct way to do things still remains.

Solution B: mount everything.

Rather than saving the src directory into docker image, we instead mount it just like we do with the experiments currently.
PROS:
- because job_ids are unique and the input mount saves each job separately, we would not lose any data, even if the changes to src were uncommitted
- text data is very space efficient so our input mount would never grow very large
CONS:
- Flies in the face of dockers isolated nature
- Could still face issues with package versioning of different docker images changing under your feet, potentially causing issues (although rare)

Solution C: remove ability to push images with uncommitted src changes

The only thing stopping users from cloning the repo from a particular time is the possibility that uncommitted changes leaked into the src directory at the time the experiment was launched, if we disallow this, then our current flow would work just fine.
PROS
- Minimal changes to existing architecture, easiest to get off the ground
CONS
- Can get in the way during a time of rapid parallel experimentation (launching multiple experiments at once, some of them changing src some of them not)
- Still a relatively clunky pipeline for reproduction that could be improved

The text was updated successfully, but these errors were encountered:

arik-shurygin added enhancement New feature or request enterprise_practices Governance housekeeping trying to keep projects usable and secure. labels Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ensure reproducibility and simplify the process of loading old jobs #287

ensure reproducibility and simplify the process of loading old jobs #287

arik-shurygin commented Oct 24, 2024

ensure reproducibility and simplify the process of loading old jobs #287

ensure reproducibility and simplify the process of loading old jobs #287

Comments

arik-shurygin commented Oct 24, 2024