Skip to content

remove human reads

Alise Ponsero edited this page Jun 24, 2024 · 2 revisions

Hostile is a complete toolkit to remove human reads from a contaminated dataset.

Install hostile

If you use a workstation or a VM, the easiest way would be to use BioConda:

# Create the hostile environment (once)
conda create -y -n hostile -c bioconda hostile

# Activate the environment (when needed)
conda activate hostile

Download the reference database

The "telomere to telomere" release of the Human Genome will allow us to map most reads. Hostile allows us to download one or more indexes.

To list the available indexes you can run hostile fetch --list, we will use in this tutorial a masked version:

hostile fetch --name human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401

💡 This step is required only once

Clean one sample

The syntax is straightforward. Here for clarity, we split the single command across multiple lines!

hostile clean \
  --fastq1 reads/ERR2619707_1.fastq.gz  --fastq2 reads/ERR2619707_2.fastq.gz 
  --threads 8 --out-dir reads/ \
  --index human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401 > $BASE.json;

The output file is a report in JSON format:

[
    {
        "version": "1.1.0",
        "aligner": "bowtie2",
        "index": "human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401",
        "options": [],
        "fastq1_in_name": "ERR2619722_1.fastq.gz",
        "fastq1_in_path": "/biobackery-training/lokmer_2019_cameroon/raw_reads/ERR2619722_1.fastq.gz",
        "fastq1_out_name": "ERR2619722_1.clean_1.fastq.gz",
        "fastq1_out_path": "reads/ERR2619722_1.clean_1.fastq.gz",
        "reads_in": 35549326,
        "reads_out": 35529516,
        "reads_removed": 19810,
        "reads_removed_proportion": 0.00056,
        "fastq2_in_name": "ERR2619722_2.fastq.gz",
        "fastq2_in_path": "/biobackery-training/lokmer_2019_cameroon/raw_reads/ERR2619722_2.fastq.gz",
        "fastq2_out_name": "ERR2619722_2.clean_2.fastq.gz",
        "fastq2_out_path": "reads/ERR2619722_2.clean_2.fastq.gz"
    }
]

while in the output directory, you will find the cleed reads like:

  • ERR2619722_1.clean_1.fastq.gz
  • ERR2619722_2.clean_2.fastq.gz

Clean multiple samples

Using a bash loop we can clean a whole directory. Here we assume the reads are divided in "_1" and "_2" (adjusting the script with _R1 and _R2 might be necessary)

INDEX=human-t2t-hla.argos-bacteria-985_rs-viral-202401_ml-phage-202401

for FWD in raw_reads/*_1.fastq.gz;
do   
   # Infer reverse filename
   REV=${FWD/_1/_2};   
   # Infer sample name (cutting the filename on the "_")
   BASE=$(basename $FWD | cut -f 1 -d _)
   # Run hostile
   hostile clean --fastq1 $FWD --fastq2 $REV --threads 8 --index $INDEX --out-dir reads/ > $BASE.json;
done

Parse logs

We provide a script to generate a table starting from a set of JSON files:

SampleName Reads_Raw Reads_Clean Percentage_Human
ERR2619719 5.18114e+07 1.56395e+07 69.814
ERR2619713 3.19179e+07 3.08334e+07 3.398
ERR2619723 3.86315e+07 3.80724e+07 1.447
ERR2619714 3.6156e+07 3.57809e+07 1.037
ERR2619725 6.82066e+07 6.75055e+07 1.028
ERR2619711 4.16947e+07 4.15606e+07 0.322
ERR2619724 4.34168e+07 4.32944e+07 0.282
ERR2619710 4.56126e+07 4.55084e+07 0.228
ERR2619708 6.6957e+07 6.68124e+07 0.216
ERR2619727 5.98338e+07 5.97282e+07 0.177
ERR2619726 5.51282e+07 5.50329e+07 0.173
ERR2619720 5.30213e+07 5.29308e+07 0.171
ERR2619721 4.96106e+07 4.95525e+07 0.117
ERR2619707 4.91292e+07 4.90753e+07 0.11
ERR2619709 6.35611e+07 6.3501e+07 0.095
ERR2619718 5.77955e+07 5.77466e+07 0.085
ERR2619715 4.54846e+07 4.54524e+07 0.071
ERR2619722 3.55493e+07 3.55295e+07 0.056
ERR2619717 6.1883e+07 6.18637e+07 0.031
ERR2619716 4.72405e+07 4.72256e+07 0.031
ERR2619712 4.1783e+07 4.17741e+07 0.021

NBI HPC