Skip to content

vetmohit89/Nanopore_psU

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Description

This protocol is used for pseudouridine (psU, Ψ) site prediction of nanopore RNA direct sequencing data. There is no minimum input reads requirement but a raw dataset of >1M reads is recommended for the following processing for human transcriptome. For a larger transcriptome, more reads are recommended.

This protocol is tested on a cluster of linux system ("midway2", Scientific Linux 7.2).

Package versions

The version of softwares and packages for testing codes:

Python2: 2.7.5

Python3: 3.8.8

guppy_basecaller: 3.2.2+9fe0a78 ((C) Oxford Nanopore Technologies, Limited) (So far this software is only available when you are a customer of Oxford Nanopore Technologies)

minimap2: 2.18-r1015

samtools: 1.11 (Copyright (C) 2020 Genome Research Ltd.)

Python packages:

pickle: 4.0 (python3)

numpy: 1.16.4

sklearn: 0.20.4

re: 2.2.1

pandas: 0.24.2

Download

You could download the package to your cluster by the following command.

git clone https://github.com/sihaohuanguc/Nanopore_psU.git

Then go to the folder with the setup.py file. And run

pip install .

Now you've installed the package. You could use it at any place of your account. If you are not clear about auy command, you could find help by the command

nanopsu -h

Please follow the guidance and obey the rules of your cluster, when running the commands.

Reference and citation

If you would like to use this package in your work, please cite the following paper:

Huang, S., Zhang, W., Katanski, C.D. et al. Interferon inducible pseudouridine modification in human mRNA by quantitative nanopore profiling. Genome Biol 22, 330 (2021). https://doi.org/10.1186/s13059-021-02557-y

Protocol

1. Base call

You could base call the reads during sequencing. If so, this step is not necessary and go to the next step directly. If the data is not base called, use the following command to do the base call.

guppy_basecaller --input_path fast5 \
                 --recursive \
                 --save_path fastq \
                 --records_per_fastq 0 \
                 --flowcell FLO-MIN106 \
                 --kit SQK-RNA002 \
                 --qscore_filtering \
                 --min_qscore 7 \
                 --cpu_threads_per_caller 3 \
                 --num_callers 5

"Input_path" is the path of your raw data. "Save_path" is your output folder. "Flowcell" is the type of nanopore flowcell you use. "Kit" is the version of nanopore direct RNA sequencing kit you use. Customize "cpu_threads_per_caller" and "num_caller" according to the state of your own cluster. This step is computation intensive.

2. Alignment and pile up

To align the reads to the reference and pile up the reads, run the following command

nanopsu alignment -i path/of/fastq/ -r reference.fa

The first argument is the input fastq path. The fastq files must be directly in this folder. The second argument is the genome reference file. The output is a folder alignment with two subfolders plus_strand and minus_strand. The two subfolders contains data of reads aligned to forward and reverse strands respectively. This step is computation intensive, which means it's NOT recommended to run the step on the login node of a cluster or on a local computer.

If you would like to test the package on your device. Please copy the example folder, which contains 200 reads of human 18S rRNA and its reference sequence, to a folder you like and go to that folder. Then run the following command. (The process of the testing example is not computation intensive and could be run on any device.)

nanopsu alignment -i example/fastq -r example/reference.fa

The output report contains the alignment summary generated by minimap2. If everything is correct, you'll get a folder named alignment in you current folder and there is a plus_strand folder in the alignment folder. Check the folder and you'll see the following files.

$ ls alignment/plus_strand/
collect.bam    collect_pile.txt  collect.sorted.bam  reference.fa.fai
collect.fastq  collect.sam       reference.fa

Of course, you could use the .bam file or the _pile.txt file for analysis by other softwares if you want. Here in this example there is no minus_strand folder as all the RNA reads are aligned to the forward strand of the reference. If you align your transcriptome sample to the genome reference, then you'll likely to have both plus_strand and minus_strand folder.

3. Feature extraction

Due to the design of samtools, in the mpileup files, the spliced reads will be filled a ">" or "<" in the jumped regions and the coverage and the quality score are affected. The data points with ">" or "<" are not real bases. Run the following script to remove the gap sections in the mpileup file. This step is computation intensive.

nanopsu remove_intron

In your plus_strand and minus_strand folders you'll find a new file named collect_pile_no_intron.txt.

For the testing example, you'll find a file called collect_pile_no_intron.txt in the plus_strand folder.

Then extract features of all U sites. In order to make the prediction more reliable, there is a threshold for a U site which is 20 reads. Only U sites with >20 reads will be processed for following analysis. This step is computation intensive.

nanopsu extract_features

The output file is features.csv in the alignment folder. This file contains information from reads aligned to both forward and reverse strands.

For the testing example, the features.csv is supposed to contain 176 lines.

4. psU prediction

To predict the psU probability of all the U sites in features.csv, run the following script.

nanopsu prediction

The output file is the prediction.csv in the alignment folder. Each row contains the reference strand, position on the reference strand (sites on (-) strand have its index on (-) strand), base type, coverage, U probability and psU probability.

For the testing example, the prediction.csv is supposed to contain 176 lines. The distribution of psU probability of the 176 sites will look like the histogram below.

Picture1 - hist of test

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%