From 7235db1e39b6197a46c779635de246e351bff5f4 Mon Sep 17 00:00:00 2001
From: zhengzhenxian <920596906@qq.com>
Date: Wed, 5 Jul 2023 16:32:23 +0800
Subject: [PATCH] updated readme
---
README.md | 50 ++++++++++++++++++++++++++++----------------------
run_clairs | 2 +-
2 files changed, 29 insertions(+), 23 deletions(-)
diff --git a/README.md b/README.md
index 237fb02..27172dc 100644
--- a/README.md
+++ b/README.md
@@ -71,6 +71,8 @@ ClairS is now available for early access to interested and experienced users. Yo
## Latest Updates
+*v0.1.3 (Jul 5)* : Added ONT Dorado 4khz (`-p ont_r10_dorado_4khz`) and 5khz (`-p ont_r10_dorado_5khz`) models, check [here](#pre-trained-models) for more details. Renamed platform options `ont_r10` to `ont_r10_guppy` and `ont_r9` to `ont_r9_guppy`.
+
*v0.1.2 (May 17)* : Added HiFi Revio model, renamed HiFi Sequel II model from `hifi` to `hifi_sequel2`.
*v0.1.1 (Apr 30)* : 1. Added the "command line used" to VCF header. 2. Added `NAU`, `NCU`, `NGU`, and `NTU` tags (#reads supporting the four bases in normal) to the output. 3. Hybrid calling mode now outputs three VCFs, ClairS somatic variant calls, Clair3 normal germline variant calls, and Clair3 tumor germline variant calls. 4. Added the `--enable_clair3_germline_output` option to also output Clair3 normal germline variant calls, and Clair3 tumor germline variant calls (even when hybrid calling more is not enabled). Running time will increase by ~40%.
@@ -91,7 +93,7 @@ ClairS is now available for early access to interested and experienced users. Yo
After following [installation](#installation), you can run ClairS with one command:
```bash
-./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10
+./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy
## Final output file: output/output.vcf.gz
```
@@ -103,17 +105,21 @@ Check [Usage](#Usage) for more options.
ClairS trained both pileup and full-alignment models using GIAB samples, and carry on benchmarking on HCC1395-HCC1395BL pair dataset. All models were trained with chr20 excluded (including only chr1-19, 21, 22).
-| Platform | Model name | Chemistry /Instruments | Option (`-p/--platform`) | Reference | Aligner |
-| :---------: | :--------------------: | :--------------------------: | :----------------------: | :-----------: | :------: |
-| ONT | ont_r104_e81_sup_g5015 | R10.4/R10.4.1 | `ont_r10` | GRCh38_no_alt | Minimap2 |
-| ONT 1 | r941_prom_sup_g5014 | R9.4.1 | `ont_r9` | GRCh38_no_alt | Minimap2 |
-| Illumina | ilmn | NovaSeq/HiseqX | `ilmn` | GRCh38 | BWA-MEM |
-| PacBio HiFi 2 | hifi_sequel2 | Sequel II with Chemistry 2.0 | `hifi_sequel2` | GRCh38_no_alt | Minimap2 |
-| PacBio HIFI | hifi_revio | Revio with SMRTbell prep kit 3.0 | `hifi_revio` | GRCh38_no_alt | Minimap2 |
+| Platform | Model name | Chemistry /Instruments | Basecaller | Option (`-p/--platform`) | Reference | Aligner |
+| :---------: | :--------------------: | :--------------------------: | :----------------------: | :-----------: | :------: | ----------- |
+| ONT1 | r1041_e82_400bps_sup_v420 | R10.4.1, 5khz | Dorado | `ont_r10_dorado_5khz` | GRCh38_no_alt | Minimap2 |
+| ONT1 | r1041_e82_400bps_sup_v410 | R10.4.1, 4khz | Dorado | `ont_r10_dorado_4khz` | GRCh38_no_alt | Minimap2 |
+| ONT | r104_e81_sup_g5015 | R10.4/R10.4.1 | Guppy5 | `ont_r10_guppy` | GRCh38_no_alt | Minimap2 |
+| ONT 2 | r941_prom_sup_g5014 | R9.4.1 | Guppy5 | `ont_r9_guppy` | GRCh38_no_alt | Minimap2 |
+| Illumina | ilmn | NovaSeq/HiseqX | - | `ilmn` | GRCh38 | BWA-MEM |
+| PacBio HiFi 3 | hifi_sequel2 | Sequel II with Chemistry 2.0 | - | `hifi_sequel2` | GRCh38_no_alt | Minimap2 |
+| PacBio HIFI | hifi_revio | Revio with SMRTbell prep kit 3.0 | - | `hifi_revio` | GRCh38_no_alt | Minimap2 |
+
+**Caveats 1**: Both `ont_r10_dorado_4khz` and `ont_r10_dorado_5khz` models are trained in HG003+HG004 synthetic dataset, a multiple-samples pre-trained model is coming soon.
-**Caveats 1**: Although the r9(`r941_prom_sup_g5014`) model was trained on synthetic samples with r9.4.1 real data, the minimal AF cutoff, minimal coverage, and post-calling filtering parameters for the r9 model are copied from the r10 model, and are not optimized due to lack of real r9 data on a cancer sample with known truths.
+**Caveats 2**: Although the r9(`r941_prom_sup_g5014`) model was trained on synthetic samples with r9.4.1 real data, the minimal AF cutoff, minimal coverage, and post-calling filtering parameters for the r9 model are copied from the r10 model, and are not optimized due to lack of real r9 data on a cancer sample with known truths.
-**Caveats 2**: The PacBio HiFi Sequel II model is experimental. It was trained but not tested with any real data with known truths. HG003 54x and HG004 52x were used, thus tumor depth coverage higher than 50x may suffer from lower recall rate. For testing, please downsample both tumor and normal to ~40x for the best performance of this experimental model.
+**Caveats 3**: The PacBio HiFi Sequel II model is experimental. It was trained but not tested with any real data with known truths. HG003 54x and HG004 52x were used, thus tumor depth coverage higher than 50x may suffer from lower recall rate. For testing, please downsample both tumor and normal to ~40x for the best performance of this experimental model.
------
@@ -137,8 +143,8 @@ docker run -it \
--normal_bam_fn ${INPUT_DIR}/normal.bam \ ## use your normal bam file name here
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
--threads ${THREADS} \ ## maximum threads to be used
- --platform ${PLATFORM} \ ## options: {ont_r10, ont_r9, ilmn, hifi_sequel2, hifi_revio}
- --output_dir ${OUTPUT_DIR} ## output path prefix
+ --platform ${PLATFORM} \ ## options: {ont_r10_dorado_4khz, ont_r10_dorado_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio}
+ --output_dir ${OUTPUT_DIR} ## output path prefix
```
Check [Usage](#Usage) for more options.
@@ -169,8 +175,8 @@ singularity exec \
--normal_bam_fn ${INPUT_DIR}/normal.bam \ ## use your normal bam file name here
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
--threads ${THREADS} \ ## maximum threads to be used
- --platform ${PLATFORM} \ ## options: {ont_r10, ont_r9, ilmn, hifi_sequel2, hifi_revio}
- --output_dir ${OUTPUT_DIR} \ ## output path prefix
+ --platform ${PLATFORM} \ ## options: {ont_r10_dorado_4khz, ont_r10_dorado_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio}
+ --output_dir ${OUTPUT_DIR} \ ## output path prefix
--conda_prefix /opt/conda/envs/clairs
```
@@ -237,8 +243,8 @@ docker run -it hkubal/clairs:latest /opt/bin/run_clairs --help
--normal_bam_fn ${INPUT_DIR}/normal.bam \ ## use your bam file name here
--ref_fn ${INPUT_DIR}/ref.fa \ ## use your reference file name here
--threads ${THREADS} \ ## maximum threads to be used
- --platform ${PLATFORM} \ ## options: {ont_r10, ont_r9, ilmn, hifi_sequel2, hifi_revio}
- --output_dir ${OUTPUT_DIR} ## output path prefix
+ --platform ${PLATFORM} \ ## options: {ont_r10_dorado_4khz, ont_r10_dorado_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio}
+ --output_dir ${OUTPUT_DIR} ## output path prefix
## Final output file: ${OUTPUT_DIR}/output.vcf.gz
```
@@ -253,7 +259,7 @@ docker run -it hkubal/clairs:latest /opt/bin/run_clairs --help
-R, --ref_fn FASTA Reference file input. The input file must be samtools indexed.
-o, --output_dir OUTPUT_DIR VCF output directory.
-t, --threads THREADS Max #threads to be used.
- -p, --platform PLATFORM Select the sequencing platform of the input. Possible options {ont_r10, ont_r9, ilmn, hifi_sequel2, hifi_revio}.
+ -p, --platform PLATFORM Select the sequencing platform of the input. Possible options {ont_r10_dorado_4khz, ont_r10_dorado_5khz, ont_r10_guppy, ont_r9_guppy, ilmn, hifi_sequel2, hifi_revio}.
```
**Miscellaneous parameters:**
@@ -299,25 +305,25 @@ docker run -it hkubal/clairs:latest /opt/bin/run_clairs --help
--normal_vcf_fn NORMAL_VCF_FN
EXPERIMENTAL: Path to normal VCF file. Setting this will skip germline varaint calling on normal BAM file input.
--enable_indel_calling
- EXPERIMENTAL: Enable Indel calling, only support ont r10 platform. The calling time would increase significantly. default: disabled.
+ EXPERIMENTAL: Enable Indel calling, only support only support `ont_r10_guppy` and `hifi_revio` platforms. The calling time would increase significantly. default: disabled.
```
#### Call SNVs in one or mutiple chromosomes using the `-C/--ctg_name` parameter
```bash
-./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10 -C chr21,chr22
+./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -C chr21,chr22
```
#### Call SNVs in one specific region using the `-r/--region` parameter
```bash
-./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10 -r chr20:1000000-2000000
+./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -r chr20:1000000-2000000
```
#### Call SNVs at interested variant sites (genotyping) using the `-G/--genotyping_mode_vcf_fn` parameter
```bash
-./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10 -G input.vcf
+./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -G input.vcf
```
#### Call SNVs in the BED regions using the `-B/--bed_fn` parameter
@@ -333,7 +339,7 @@ echo -e "${CTG2}\t${START_POS_2}\t${END_POS_2}" >> input.bed
Then:
```bash
-./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10 -B input.bed
+./run_clairs -T tumor.bam -N normal.bam -R ref.fa -o output -t 8 -p ont_r10_guppy -B input.bed
```
------
diff --git a/run_clairs b/run_clairs
index 47a2e92..d699086 100755
--- a/run_clairs
+++ b/run_clairs
@@ -1579,7 +1579,7 @@ def somatic_parser():
ont_params.add_argument(
"--enable_indel_calling",
action='store_true',
- help="EXPERIMENTAL: Enable Indel calling, only support ont r10 platform. The calling time would increase significantly. default: disabled."
+ help="EXPERIMENTAL: Enable Indel calling, only support `ont_r10_guppy` and `hifi_revio` platforms. The calling time would increase significantly. default: disabled."
)
optional_params.add_argument(