generated from jhudsl/OTTR_Template
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04-calculate-idxstats.Rmd
183 lines (134 loc) · 7.76 KB
/
04-calculate-idxstats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
```{r, echo = FALSE}
knitr::opts_chunk$set(out.width = "100%")
```
# Calculate idxstats on multiple files
Now that you’ve successfully written an introductory WDL, this chapter demonstrates how WDL Workflows can easily perform an analysis across multiple genomic data files.
This material is adapted from the [WDL Bootcamp workshop](https://support.terra.bio/hc/en-us/articles/18618717942427).
For more hands-on WDL-writing exercises, see [Hands-on practice for scripting and configuring Terra workflows](https://support.terra.bio/hc/en-us/articles/360056599991).
**Learning Objectives**
- Solidify your understanding of a WDL script’s structure
- Modify a template to write a more complex WDL
- Understand how to customize a workflow’s setup with WDL
## Clone data workspace
Let’s start by navigating to the [demos-combine-data-workspaces](https://anvil.terra.bio/#workspaces/anvil-outreach/demos-combine-data-workspaces) on AnVIL-powered-by-Terra.
This workspace contains a Data Table named `sample` which contains references to four .cram files, two from the [1000 Genomes Project](https://anvil.terra.bio/#workspaces/anvil-datastorage/1000G-high-coverage-2019) and two from the [Human Pangenome Reference Consortium](https://anvil.terra.bio/#workspaces/anvil-datastorage/AnVIL_HPRC).
Clone this workspace to create a place where you can organize your analysis.
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g1397c25e58c_0_181")
```
Examine the `sample` table in the Data tab to ensure that you see references to four .cram files.
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_1")
```
In the next steps, you will write a WDL to analyze these .cram files, run the Workflow, and examine the output.
## Write an idxstats WDL
To build off of your hello-input WDL, let’s practice writing a more complex WDL. In this exercise, you’ll fill in a template script to calculate Quality Control (QC) metrics for a BAM/CRAM file using the `samtools idxstats` function. For this exercise, there are two WDL `runtime` parameters that we **must** update for the Workflow to succeed:
- `docker` – Specify a Docker image that contains necessary software
- `disks` – Increase the disk size for each provisioned resource
Start by [downloading the template script](https://drive.google.com/file/d/1OH4L5LQNquDhNvycRHzWVH6Z1HR5R7kD) shown below. Open it in a text editor and modify it to call `samtools idxstats` on a bam file.
```
version 1.0
workflow samtoolsIdxstats {
input {
File bamfile
}
call {
input:
}
output {
}
}
task {
input {
}
command <<<
>>>
output {
}
runtime {
docker: ''
}
}
```
Importantly, you **must**:
- Specify a Docker image that contains SAMtools (e.g. `ekiernan/wdl-101:v1`)
- Increase the disk size for each provisioned resource, e.g. `local-disk 50 HDD`
Hints:
- Follow the same general method as in the “Hello World” exercise in section 3.3.
- In this case, the input will be a BAM file.
- The output will be a file called `idxststats.txt`.
- The task will be to calculate QC metrics, using this command:
```
samtools index ~{bamfile}
samtools idxstats ~{bamfile} > counts.txt
```
Then, compare your version to the completed version below:
```
version 1.0
workflow samtoolsIdxstats {
input {
File bamfile
}
call idxstats {
input:
bamfile = bamfile
}
output {
File results = idxstats.idxstats
}
}
task idxstats {
input {
File bamfile
}
command <<<
samtools index ~{bamfile}
samtools idxstats ~{bamfile} > idxstats.txt
>>>
output {
File idxstats = "idxstats.txt"
}
runtime {
disks: 'local-disk 50 HDD'
docker: 'ekiernan/wdl-101:v1'
}
}
```
## Optional: Run idxstats WDL on multiple .bam files
You can test this workflow out by creating a new method in the Broad Methods Repository, exporting it to your clone of the demos-combine-data-workspaces workspace, and running it on samples in the ‘sample data table.
First, go to the “Workflows” tab and access the Broad Methods Repository through the “Find a Workflow” card:
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_6")
```
Copy and paste the idxstats WDL you wrote above and export to your workspace (see Chapter 3 if you need a refresher). Next, select “Run workflow(s) with inputs defined by data table” and choose the .cram files that you wish to analyze:
- Step 1: Select the `sample` table in the root entity type drop-down menu
- Step 2: Click “Select Data” and tick the checkboxes for one or more rows in the data table
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_11")
```
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_16")
```
Finally, configure the “Inputs” tab by specifying `this.cram` as the Attribute for the variable `bamfile` for the task `samtoolsIdxstats`. **Don’t forget to click “Save”**.
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_21")
```
Now run the job by clicking “Run Analysis”! You can monitor the progress from “Queued” to “Running” to “Succeeded” in the “Job History” tab
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_47")
```
Once the job is complete, navigate to the “Data” tab and click on “Files” to find the `idxstats.txt` output and logs by traversing through
```
submissions/<submission_id>/samtoolsIdxstats/<workflow_id>/call-idxstats/
```
```{r, echo=FALSE}
ottrpal::include_slide("https://docs.google.com/presentation/d/1o2XnuMbqWVLf4XrsXolIQ7ulfnMlpJlrUxN0Y8aLIVQ/edit#slide=id.g288edfe8bc0_0_53")
```
## Customize your Workflow’s Setup with WDL
In addition to defining the workflow’s tasks, WDL scripts can define how your workflow runs in AnVIL-powered-by-Terra.
### Memory retry
Some workflows require more memory than others. But, memory is not free, so you don’t want to request more memory than you need. One solution to this tension is to start with a small amount of memory and then request more if you run out of memory. Learn how to do this from your WDL script by reading [Out of Memory Retry](https://support.terra.bio/hc/en-us/articles/4403215299355), and see [Workflow setup: VM and other options](https://support.terra.bio/hc/en-us/articles/360026521831) for a general overview of how to set up your workflow’s compute resources.
### Localizing files
It can be hard to know where your data files are located within your workspace bucket – the folders aren’t intuitively named, and often your files are saved several folders deep.
Luckily, WDL scripts can localize your files for you. For more on this, see [How to configure workflow inputs](https://support.terra.bio/hc/en-us/articles/4415971884827), [How to use DRS URIs in a workflow](https://support.terra.bio/hc/en-us/articles/6635144998939), and [Creating a list file of reads for input to a workflow](https://support.terra.bio/hc/en-us/articles/360033353952).
If your workflow generates files, you can also write their location to a data table. This is useful for both intermediate files and the workflow’s final outputs. For more on this topic, see [Writing workflow outputs to the data table](https://support.terra.bio/hc/en-us/articles/4500420806299).