This protocol is used for pseudouridine (psU, Ψ) site prediction of nanopore RNA direct sequencing data. There is no minimum input reads requirement but a raw dataset of >1M reads is recommended for the following processing for human transcriptome. For a larger transcriptome, more reads are recommended.
This protocol is tested on a cluster of linux system ("midway2", Scientific Linux 7.2).
The version of softwares and packages for testing codes:
Python2: 2.7.5
Python3: 3.8.8
guppy_basecaller: 3.2.2+9fe0a78 ((C) Oxford Nanopore Technologies, Limited) (So far this software is only available when you are a customer of Oxford Nanopore Technologies)
minimap2: 2.18-r1015
samtools: 1.11 (Copyright (C) 2020 Genome Research Ltd.)
Python packages:
pickle: 4.0 (python3)
numpy: 1.16.4
sklearn: 0.20.4
re: 2.2.1
pandas: 0.24.2
You could download the package to your cluster by the following command.
git clone https://github.com/sihaohuanguc/Nanopore_psU.git
Then go to the folder with the setup.py
file. And run
pip install .
Now you've installed the package. You could use it at any place of your account. If you are not clear about auy command, you could find help by the command
nanopsu -h
Please follow the guidance and obey the rules of your cluster, when running the commands.
If you would like to use this package in your work, please cite the following paper:
Huang, S., Zhang, W., Katanski, C.D. et al. Interferon inducible pseudouridine modification in human mRNA by quantitative nanopore profiling. Genome Biol 22, 330 (2021). https://doi.org/10.1186/s13059-021-02557-y
You could base call the reads during sequencing. If so, this step is not necessary and go to the next step directly. If the data is not base called, use the following command to do the base call.
guppy_basecaller --input_path fast5 \
--recursive \
--save_path fastq \
--records_per_fastq 0 \
--flowcell FLO-MIN106 \
--kit SQK-RNA002 \
--qscore_filtering \
--min_qscore 7 \
--cpu_threads_per_caller 3 \
--num_callers 5
"Input_path" is the path of your raw data. "Save_path" is your output folder. "Flowcell" is the type of nanopore flowcell you use. "Kit" is the version of nanopore direct RNA sequencing kit you use. Customize "cpu_threads_per_caller" and "num_caller" according to the state of your own cluster. This step is computation intensive.
To align the reads to the reference and pile up the reads, run the following command
nanopsu alignment -i path/of/fastq/ -r reference.fa
The first argument is the input fastq path. The fastq files must be directly in this folder. The second argument is the genome reference file.
The output is a folder alignment
with two subfolders plus_strand
and minus_strand
. The two subfolders contains data of reads aligned to forward and reverse strands respectively. This step is computation intensive, which means it's NOT recommended to run the step on the login node of a cluster or on a local computer.
If you would like to test the package on your device. Please copy the example
folder, which contains 200 reads of human 18S rRNA and its reference sequence, to a folder you like and go to that folder. Then run the following command. (The process of the testing example is not computation intensive and could be run on any device.)
nanopsu alignment -i example/fastq -r example/reference.fa
The output report contains the alignment summary generated by minimap2. If everything is correct, you'll get a folder named alignment
in you current folder and there is a plus_strand
folder in the alignment
folder. Check the folder and you'll see the following files.
$ ls alignment/plus_strand/
collect.bam collect_pile.txt collect.sorted.bam reference.fa.fai
collect.fastq collect.sam reference.fa
Of course, you could use the .bam
file or the _pile.txt
file for analysis by other softwares if you want. Here in this example there is no minus_strand
folder as all the RNA reads are aligned to the forward strand of the reference. If you align your transcriptome sample to the genome reference, then you'll likely to have both plus_strand
and minus_strand
folder.
Due to the design of samtools, in the mpileup files, the spliced reads will be filled a ">" or "<" in the jumped regions and the coverage and the quality score are affected. The data points with ">" or "<" are not real bases. Run the following script to remove the gap sections in the mpileup file. This step is computation intensive.
nanopsu remove_intron
In your plus_strand
and minus_strand
folders you'll find a new file named collect_pile_no_intron.txt
.
For the testing example, you'll find a file called collect_pile_no_intron.txt
in the plus_strand
folder.
Then extract features of all U sites. In order to make the prediction more reliable, there is a threshold for a U site which is 20 reads. Only U sites with >20 reads will be processed for following analysis. This step is computation intensive.
nanopsu extract_features
The output file is features.csv
in the alignment
folder. This file contains information from reads aligned to both forward and reverse strands.
For the testing example, the features.csv
is supposed to contain 176 lines.
To predict the psU probability of all the U sites in features.csv
, run the following script.
nanopsu prediction
The output file is the prediction.csv
in the alignment
folder. Each row contains the reference strand, position on the reference strand (sites on (-) strand have its index on (-) strand), base type, coverage, U probability and psU probability.
For the testing example, the prediction.csv
is supposed to contain 176 lines. The distribution of psU probability of the 176 sites will look like the histogram below.