Skip to content

Commit

Permalink
Merge pull request #328 from latchbio/ayush/sm-config-parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
ayushkamat authored Oct 24, 2023
2 parents cc4efce + be73bb5 commit 7e9d0f2
Show file tree
Hide file tree
Showing 17 changed files with 812 additions and 160 deletions.
49 changes: 35 additions & 14 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,23 +23,43 @@ Types of changes
* `latch.registry.record.Record.get_table_id` method for querying the ID of the table containing a given registry record
* `latch.registry.table.Table.get_project_id` method for querying the ID of the project containing a given registry table

## 2.35.0 - 2023-10-21

### Added

* Snakemake
+ Remote register support
+ `download` field for file inputs
+ `config` field for file inputs
+ Blanket support for parameters of any type via the `SnakemakeParameter` class
+ Support for generating a `latch_metadata` directory from a `config.yaml` with `latch generate-metadata`
+ Support for default values for parameters

### Changed

* Snakemake
+ JIT register step no longer downloads input files by default
+ `latch_metadata` should now be a module (directory containing an `__init__.py` file), as opposed to just being a file

## 2.34.0 - 2023-10-04

### Added

* `directory` modifier for input / outputs
* Support `temp` by removing from compiled rules. All files / directories are
temporary because they are deleted at the end of each job on Latch.
* `multiext` output modifier
* `report` output modifier
* `params` in rules
* Snakemake
+ `directory` modifier for input / outputs
+ Support `temp` by removing from compiled rules. All files / directories are
temporary because they are deleted at the end of each job on Latch.
+ `multiext` output modifier
+ `report` output modifier
+ `params` in rules

### Fixed

* Replace skipped rules with `Ellipsis`. Supports rules nested in conditionals where previously an empty block was produced.
* Patched parser to generate compiled code for `workflow.include` calls Compiled workflow.include should carry `print_compilation` keyword (snakemake/snakemake#2469)
* Detect use of `conda` keyword and install in image. This effectively supports wrapper/conda keywords.
* `Iterable, Generator` cause issues as type hints when imported from `collections.abc` rather than `typing`
* Snakemake
+ Replace skipped rules with `Ellipsis`. Supports rules nested in conditionals where previously an empty block was produced.
+ Patched parser to generate compiled code for `workflow.include` calls Compiled workflow.include should carry `print_compilation` keyword (snakemake/snakemake#2469)
+ Detect use of `conda` keyword and install in image. This effectively supports wrapper/conda keywords.
+ `Iterable, Generator` cause issues as type hints when imported from `collections.abc` rather than `typing`

## 2.33.0 - 2023-09-29

Expand All @@ -52,17 +72,18 @@ temporary because they are deleted at the end of each job on Latch.
### Fixed

* Snakemake:
* Better errors if `Snakefile` or `latch_metadata.py` file missing
* Correct issues with snakemake example project
- Better errors if `Snakefile` or `latch_metadata.py` file missing
- Correct issues with snakemake example project

## 2.32.7 - 2023-09-07

### Fixed

* Snakemake:
* `--snakemake` for `latch dockerfile` command to generate `Dockerfile` with
- `--snakemake` for `latch dockerfile` command to generate `Dockerfile` with
necessary instructions
* Snakemake example for `latch init`

- Snakemake example for `latch init`

## 2.32.6 - 2023-09-07

Expand Down
86 changes: 46 additions & 40 deletions docs/source/manual/snakemake.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Latch's snakemake integration allows developers to build graphical interfaces to

A primary design goal for integration is to allow developers to register existing projects with minimal added boilerplate and modifications to code. Here we outline exactly what these changes are and why they are needed.

Recall a snakemake project consists of a `Snakefile`, which describes workflow
Recall a snakemake project consists of a `Snakefile` , which describes workflow
rules in an ["extension"](https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html) of Python, and associated python code imported and called by these rules. To make this project compatible with Latch, we need to do the following:

1. Identify and construct explicit parameters for each file dependency in `latch_metadata.py`
Expand Down Expand Up @@ -74,23 +74,24 @@ SnakemakeMetadata(
```

### Step 2: Define all dependencies in a container
When executing Snakemake jobs on Latch, the jobs run within an environment specified by a `Dockerfile`. It is important to ensure that all required dependencies, whether they are third-party binaries, python libraries, or shell scripts, are correctly installed and configured within this `Dockerfile` so the job has access to them.

When executing Snakemake jobs on Latch, the jobs run within an environment specified by a `Dockerfile` . It is important to ensure that all required dependencies, whether they are third-party binaries, python libraries, or shell scripts, are correctly installed and configured within this `Dockerfile` so the job has access to them.

**Key Dependencies to Consider**:
* Python Packages:
* Specify these in a `requirements.txt` or `environment.yaml` file.
+ Specify these in a `requirements.txt` or `environment.yaml` file.
* Conda Packages:
* List these in an `environment.yaml` file.
+ List these in an `environment.yaml` file.
* Bioinformatics Tools:
* Often includes third-party binaries. They will need to be manually added to the Dockerfile.
+ Often includes third-party binaries. They will need to be manually added to the Dockerfile.
* Snakemake wrappers and containers:
* Note that while many Snakefile rules use singularity or docker containers, Latch doesn't currently support these wrapper or containerized environments. Therefore, all installation codes for these must be manually added into the Dockerfile.
+ Note that while many Snakefile rules use singularity or docker containers, Latch doesn't currently support these wrapper or containerized environments. Therefore, all installation codes for these must be manually added into the Dockerfile.

**Generating a Customizable Dockerfile:**

To generate a `Dockerfile` that can be modified, use the following command:

`latch dockerfile <workflow_folder>`
`latch dockerfile <workflow_folder>`

The above command searches for the `environment.yaml` and `requirements.txt` files within your project directory. Based on these, it generates Dockerfile instructions to install the specified Conda and Python dependencies.

Expand All @@ -102,20 +103,21 @@ When you register your snakemake project with Latch, a container is automaticall

When snakemake workflows are executed on Latch, each generated job is run in a separate container on a potentially isolated machine. This means your `Snakefile` might need to be modified to address problems that arise from this type of execution that were not present when executing locally:

- Add missing rule inputs that are implicitly fulfiled when executing locally. Index files for biological data are commonly expected to always be alongside their matching data.
- Make sure shared code does not rely on input files. This is any code that is not under a rule and so gets executed by every task
- Add `resources` directives if tasks run out of memory or disk space
- Optimize data transfer by merging tasks that have 1-to-1 dependencies
* Add missing rule inputs that are implicitly fulfiled when executing locally. Index files for biological data are commonly expected to always be alongside their matching data.
* Make sure shared code does not rely on input files. This is any code that is not under a rule and so gets executed by every task
* Add `resources` directives if tasks run out of memory or disk space
* Optimize data transfer by merging tasks that have 1-to-1 dependencies

### Step 4: Register your project

When the above steps have been taken, it is safe to register your project with the Latch CLI.
When the above steps have been taken, it is safe to register your project with the Latch CLI.

Example: `latch register <workflow_folder>/ --snakefile <workflow_folder>/Snakefile`

This command will build a container and construct a graphical interface from your `latch_metdata.py` file. When this process has completed, a link to view your workflow on the Latch console will be printed to `stdout`.
This command will build a container and construct a graphical interface from your `latch_metdata.py` file. When this process has completed, a link to view your workflow on the Latch console will be printed to `stdout` .

---

## Lifecycle of a Snakemake Execution on Latch

Snakemake support is currently based on JIT (Just-In-Time) registraton. This means that the workflow produced by `latch register` will only register a second workflow, which will run the actual pipeline tasks. This is because the actual structure of the workflow cannot be specified until parameter values are provided.
Expand All @@ -131,8 +133,8 @@ The first ("JIT") workflow does the following:

Debugging:

- The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
- Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`
* The generated runtime workflow entrypoint is uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/entrypoint.py`
* Internal workflow specifications are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/spec`

### Runtime Workflow

Expand All @@ -145,7 +147,7 @@ Each task runs a modified Snakemake executable using a script from the Latch SDK

Debugging:

- The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/compiled_tasks`
* The Snakemake-compiled tasks are uploaded to `latch:///.snakemake_latch/workflows/<workflow_name>/compiled_tasks`

#### Example

Expand Down Expand Up @@ -191,8 +193,8 @@ def __rule_fastqc(input, output, ...):

Note:

- The "all" rule is entirely commented out
- The "fastqc" rule has no wildcards in its decorators
* The "all" rule is entirely commented out
* The "fastqc" rule has no wildcards in its decorators

### Limitations

Expand Down Expand Up @@ -221,7 +223,7 @@ If registration fails before metadata can be pulled, the CLI will generate an ex

### Input Parameters

Since there is no explicit entrypoint (`@workflow`) function in a Snakemake workflow, parameters are instead specified in the metadata file.
Since there is no explicit entrypoint ( `@workflow` ) function in a Snakemake workflow, parameters are instead specified in the metadata file.

Currently only `LatchFile` and `LatchDir` parameters are supported. Both directory and file inputs are specified using `SnakemakeFileParameter` and setting the `type` field as appropriate.

Expand All @@ -243,12 +245,11 @@ parameters = {

| Problem | Common Solution |
| -------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| `The above error occured when reading the Snakefile to extract workflow metadata.` | Snakefile has errors outside of any rules. Frequently caused by missing dependencies (look for `ModuleNotFoundError`). Either install dependencies or add a `latch_metadata.py` file |
| `The above error occured when reading the Snakefile to extract workflow metadata.` | Snakefile has errors outside of any rules. Frequently caused by missing dependencies (look for `ModuleNotFoundError` ). Either install dependencies or add a `latch_metadata.py` file |
| `snakemake.exceptions.WorkflowError: Workflow defines configfile config.yaml but it is not present or accessible (full checked path: /root/config.yaml)` | Include a `config.yaml` in the workflow Docker image. Currently, config files cannot be generated from workflow parameters. |
| `Command '['/usr/local/bin/python', '-m', 'latch_cli.snakemake.single_task_snakemake', ...]' returned non-zero exit status 1.` | The runtime single-job task failed. Look at logs to find the error. It will be marked with the string `[!] Failed`. |
| `Command '['/usr/local/bin/python', '-m', 'latch_cli.snakemake.single_task_snakemake', ...]' returned non-zero exit status 1.` | The runtime single-job task failed. Look at logs to find the error. It will be marked with the string `[!] Failed` . |
| Runtime workflow task fails with `FileNotFoundError in file /root/workflow/Snakefile` but the file is specified in workflow parameters | Wrap the code that reads the file in a function. **See section "Input Files Referenced Outside of Rules"** |
| MultiQC `No analysis results found. Cleaning up..` | FastQC outputs two files for every FastQ file: the raw `.zip` data and the HTML report. Include the raw `.zip` outputs of FastQC in the MultiQC rule inputs. **See section "Input Files Not Explicitly Defined in Rules"** "

| MultiQC `No analysis results found. Cleaning up..` | FastQC outputs two files for every FastQ file: the raw `.zip` data and the HTML report. Include the raw `.zip` outputs of FastQC in the MultiQC rule inputs. **See section "Input Files Not Explicitly Defined in Rules"** "

## Troubleshooting: Input Files Referenced Outside of Rules

Expand All @@ -273,7 +274,7 @@ rule fastqc:
fastqc {input} -o {output}
```

Since the `Path("inputs").glob(...)` call is not under any rule, _it runs in all tasks._ Because the `fastqc` rule does not specify `input_dir` as an `input`, it will not be downloaded and the code will throw an error.
Since the `Path("inputs").glob(...)` call is not under any rule, _it runs in all tasks._ Because the `fastqc` rule does not specify `input_dir` as an `input` , it will not be downloaded and the code will throw an error.

### Solution

Expand All @@ -297,7 +298,7 @@ rule all_function:
expand("fastqc/{sample}.html", sample=get_samples())
```

This works because the JIT step replaces `input`, `output`, `params`, and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again.
This works because the JIT step replaces `input` , `output` , `params` , and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again.

**Same example at runtime:**

Expand Down Expand Up @@ -333,9 +334,11 @@ rule all:
```

## Troubleshooting: Input Files Not Explicitly Defined in Rules

When running the snakemake workflow locally, not all input files must be explicitly defined in every rule because all files are generated on one computer. However, tasks on Latch only download files specified by their target rules. Thus, unspecified input files will cause the Snakefile rule to fail due to missing input files.

**Example**

```python
# ERROR: the .zip file produced by the the fastqc rule is not found in the multiqc rule!
Expand All @@ -353,7 +356,7 @@ rule fastqc:
shell("fastqc -o {params} --noextract -k 5 -t 8 -f fastq {input} 2>{log}")
rule multiqc:
input:
input:
aligned_sequences = join(WORKDIR, "plasmid_wells_aligned_sequences.csv")
output: directory(join(WORKDIR, "QC", "multiqc_report", 'raw'))
params:
Expand All @@ -367,9 +370,11 @@ rule multiqc:
```

### Solution
For programs that produce multiple types of input files (e.g. `.zip` and `.html` in the case of FastQC), explicitly specify these files in the outputs of the previous rule and in the inputs of the subsequent rule.

For programs that produce multiple types of input files (e.g. `.zip` and `.html` in the case of FastQC), explicitly specify these files in the outputs of the previous rule and in the inputs of the subsequent rule.

**Example**

```python
def get_samples():
samples = Path("/root").glob("*fastqc.zip")
Expand All @@ -390,7 +395,7 @@ rule fastqc:
shell("fastqc -o {params} --noextract -k 5 -t 8 -f fastq {input} 2>{log}")
rule multiqc:
input:
input:
aligned_sequences = join(WORKDIR, "plasmid_wells_aligned_sequences.csv")
# Specify zip as the input for every sample from fastqc
zip = expand(
Expand All @@ -411,20 +416,21 @@ rule multiqc:
```

## Snakemake Roadmap

### Known Issues

- Task caching does not work, tasks always re-run when a new version of the workflow is run even if nothing specific has changed
- It is not possible to configure the amount of available ephemeral storage
- Remote registration is not supported
- Snakemake tasks are serialized using a faulty custom implementation which does not support things like caching. Should use actual generated python code instead
- JIT workflow image should run snakemake extraction as a smoketest before being registered as a workflow
- Workflows with no parameters break the workflow params page on console UI
- Cannot set parameter defaults
- Parameter keys are unusued but are required in the metadata
- Log file tailing does not work
* Task caching does not work, tasks always re-run when a new version of the workflow is run even if nothing specific has changed
* It is not possible to configure the amount of available ephemeral storage
* Remote registration is not supported
* Snakemake tasks are serialized using a faulty custom implementation which does not support things like caching. Should use actual generated python code instead
* JIT workflow image should run snakemake extraction as a smoketest before being registered as a workflow
* Workflows with no parameters break the workflow params page on console UI
* Cannot set parameter defaults
* Parameter keys are unusued but are required in the metadata
* Log file tailing does not work

### Future Work

- Warn when the Snakefile reads files not on the docker image outside of any rules
- FUSE
- File/directory APIs
* Warn when the Snakefile reads files not on the docker image outside of any rules
* FUSE
* File/directory APIs
Loading

0 comments on commit 7e9d0f2

Please sign in to comment.