-
Notifications
You must be signed in to change notification settings - Fork 0
remove human reads
Hostile is a complete toolkit to remove human reads from a contaminated dataset.
If you use a workstation or a VM, the easiest way would be to use BioConda:
# Create the hostile environment (once)
conda create -y -n hostile -c bioconda hostile
# Activate the environment (when needed)
conda activate hostile
The "telomere to telomere" release of the Human Genome will allow us to map most reads. Hostile allows us to download one or more indexes.
To list the available indexes you can run hostile fetch --list
, we will use in this tutorial a masked version:
hostile fetch --name human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401
💡 This step is required only once
The syntax is straightforward. Here for clarity, we split the single command across multiple lines!
hostile clean \
--fastq1 reads/ERR2619707_1.fastq.gz --fastq2 reads/ERR2619707_2.fastq.gz
--threads 8 --out-dir reads/ \
--index human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 > $BASE.json;
The output file is a report in JSON format:
[
{
"version": "1.1.0",
"aligner": "bowtie2",
"index": "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401",
"options": [],
"fastq1_in_name": "ERR2619722_1.fastq.gz",
"fastq1_in_path": "/biobackery-training/lokmer_2019_cameroon/raw_reads/ERR2619722_1.fastq.gz",
"fastq1_out_name": "ERR2619722_1.clean_1.fastq.gz",
"fastq1_out_path": "reads/ERR2619722_1.clean_1.fastq.gz",
"reads_in": 35549326,
"reads_out": 35529516,
"reads_removed": 19810,
"reads_removed_proportion": 0.00056,
"fastq2_in_name": "ERR2619722_2.fastq.gz",
"fastq2_in_path": "/biobackery-training/lokmer_2019_cameroon/raw_reads/ERR2619722_2.fastq.gz",
"fastq2_out_name": "ERR2619722_2.clean_2.fastq.gz",
"fastq2_out_path": "reads/ERR2619722_2.clean_2.fastq.gz"
}
]
while in the output directory, you will find the cleed reads like:
- ERR2619722_1.clean_1.fastq.gz
- ERR2619722_2.clean_2.fastq.gz
Using a bash loop we can clean a whole directory. Here we assume the reads are divided in "_1" and "_2" (adjusting the script with _R1 and _R2 might be necessary)
INDEX=human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401
for FWD in raw_reads/*_1.fastq.gz;
do
# Infer reverse filename
REV=${FWD/_1/_2};
# Infer sample name (cutting the filename on the "_")
BASE=$(basename $FWD | cut -f 1 -d _)
# Run hostile
hostile clean --fastq1 $FWD --fastq2 $REV --threads 8 --index $INDEX --out-dir reads/ > $BASE.json;
done
We provide a script to generate a table starting from a set of JSON files:
SampleName | Reads_Raw | Reads_Clean | Percentage_Human |
---|---|---|---|
ERR2619719 | 5.18114e+07 | 1.56395e+07 | 69.814 |
ERR2619713 | 3.19179e+07 | 3.08334e+07 | 3.398 |
ERR2619723 | 3.86315e+07 | 3.80724e+07 | 1.447 |
ERR2619714 | 3.6156e+07 | 3.57809e+07 | 1.037 |
ERR2619725 | 6.82066e+07 | 6.75055e+07 | 1.028 |
ERR2619711 | 4.16947e+07 | 4.15606e+07 | 0.322 |
ERR2619724 | 4.34168e+07 | 4.32944e+07 | 0.282 |
ERR2619710 | 4.56126e+07 | 4.55084e+07 | 0.228 |
ERR2619708 | 6.6957e+07 | 6.68124e+07 | 0.216 |
ERR2619727 | 5.98338e+07 | 5.97282e+07 | 0.177 |
ERR2619726 | 5.51282e+07 | 5.50329e+07 | 0.173 |
ERR2619720 | 5.30213e+07 | 5.29308e+07 | 0.171 |
ERR2619721 | 4.96106e+07 | 4.95525e+07 | 0.117 |
ERR2619707 | 4.91292e+07 | 4.90753e+07 | 0.11 |
ERR2619709 | 6.35611e+07 | 6.3501e+07 | 0.095 |
ERR2619718 | 5.77955e+07 | 5.77466e+07 | 0.085 |
ERR2619715 | 4.54846e+07 | 4.54524e+07 | 0.071 |
ERR2619722 | 3.55493e+07 | 3.55295e+07 | 0.056 |
ERR2619717 | 6.1883e+07 | 6.18637e+07 | 0.031 |
ERR2619716 | 4.72405e+07 | 4.72256e+07 | 0.031 |
ERR2619712 | 4.1783e+07 | 4.17741e+07 | 0.021 |