-
Notifications
You must be signed in to change notification settings - Fork 4
5. Data postprocessing
The Ninetails package offers possibility of further processing of obtained files (objects), including visualizations.
Data post-processing module needs output from main Ninetails pipeline (e.g. check_tails()
) to work.
Ninetails can read a single output file with read_class_single()
in case of read_classes data frame and read_residue_single()
in case of nonadenosine_residues data frame:
class_path <- "/directory/with/ninetails/read_class_output.tsv"
class_data <- ninetails::read_class_single(class_path)
residue_path <- "/directory/with/ninetails/nonadenosine_residues_output.tsv"
residue_data <- ninetails::read_residue_single(residue_path)
Ninetails can read multiple output files at once with read_class_multiple()
in case of read_classes data frame and read_residue_multiple()
in case of nonadenosine_residues data frame. It also can associate any metadata provided by the user.
Note
In order to use built-in data processing and/or data vis modules, user has to provide at least following metadata:
- sample_name - unique ID of the sample/replicate
- group - experimental condition
- class_path - path to read_classes data frame
- residue_path - path to nonadenosine_residues data frame
Let's assume we have performed an experiment with two conditions (group_1, group_2), and two replicates per condition (sample_1, sample_2, sample_3, sample_4). After running the check_tails()
function, we will have 2 output files per each sample (read_classes and nonadenosine_residues, respectively).
We can read all of them at once and associate metadata from provided additional data frame:
# define table with metadata
samples_table <- data.frame(sample_name = c("sample_1","sample_2","sample_3","sample_4"),
group = c("group_1","group_1","group_2","group_2"),
class_path = c("/home/user/ANALYSES/Ninetails/sample_1/read_classes.tsv",
"/home/user/ANALYSES/Ninetails/sample_2/read_classes.tsv",
"/home/user/ANALYSES/Ninetails/sample_3/read_classes.tsv",
"/home/user/ANALYSES/Ninetails/sample_4/read_classes.tsv"),
residue_path = c("/home/user/ANALYSES/Ninetails/sample_1/nonadenosine_residues.tsv",
"/home/user/ANALYSES/Ninetails/sample_2/nonadenosine_residues.tsv",
"/home/user/ANALYSES/Ninetails/sample_3/nonadenosine_residues.tsv",
"/home/user/ANALYSES/Ninetails/sample_4/nonadenosine_residues.tsv"))
# read the data at once
class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)
Alternatively, one may provide metadata in configuration file (config.yml) and then read the data as in the following example:
# provide metadata
config<-yaml::yaml.load_file("config_dummy.yml")
samples_table<-data.frame(t(sapply(config$samples,unlist)))
rownames(samples_table) <- NULL
# read the data at once
class_data <- ninetails::read_class_multiple(samples_table)
residue_data <- ninetails::read_residue_multiple(samples_table)
An example content of the config.yml:
samples:
sample_1:
sample_name: sample_1
group: group_1
class_path: /home/user/ANALYSES/Ninetails/sample_1/read_classes.tsv
residue_path: /home/user/ANALYSES/Ninetails/sample_1/nonadenosine_residues.tsv
sample_2:
sample_name: sample_2
group: group_1
class_path: /home/user/ANALYSES/Ninetails/sample_2/read_classes.tsv
residue_path: /home/user/ANALYSES/Ninetails/sample_2/nonadenosine_residues.tsv
sample_3:
sample_name: sample_3
group: group_2
class_path: /home/user/ANALYSES/Ninetails/sample_3/read_classes.tsv
residue_path: /home/user/ANALYSES/Ninetails/sample_3/nonadenosine_residues.tsv
sample_4:
sample_name: sample_4
group: group_2
class_path: /home/user/ANALYSES/Ninetails/sample_4/read_classes.tsv
residue_path: /home/user/ANALYSES/Ninetails/sample_4/nonadenosine_residues.tsv
Note
This is just a minimal reproducible example. User may provide any sort of additional data (e.g. guppy version, reference transcriptome, batch...)
Ninetails allows to minimize segmentation errors inherited from nanopolish
.
Sometimes nucleotides from the 3' ends of some AT-rich transcripts are misidentified as poly(A) tails, when in fact they are still nucleotides belonging to the body of the transcript. A large enrichment of non-adenosine positions is observed in close proximity to the body of these transcripts.
To minimize the impact of segmentation artifacts on the results, one can use the following function:
# Reclassify the data
ninetails_data <- reclassify_ninetails_data(residue_data=residue_data,
class_data=class_data,
grouping_factor="sample_name",
transcript_column="ensembl_transcript_id_short",
ref="mmusculus")
# Retrieve the data frames
class_data <- ninetails_data[[1]]
residue_data <- ninetails_data[[2]]
Note
This function should be applied before further analysis/manipulation on the class and residue data.
Currently, Ninetails can reclassify transcripts from the following species:
- Arabidopsis thaliana
- Homo sapiens
- Mus musculus
- Saccharomyces cerevisiae
- Caenorhabditis elegans
- Trypanosoma brucei
Detailed information about the correction of data from other sources can be found in the function documentation.
Ninetails provides function to merge tabular outputs to produce one concise table for all data. Each read is represented by a single row.
merged_tables <- ninetails::merge_nonA_tables(class_data=class_data,
residue_data=residue_data,
pass_only=TRUE)
In addition, an extra nonA_residues column is located at the end of the output table. It contains all non-A residues positions summarized (per read), given from the 5' to 3' end, separated by ":".
In this table, only reads that have been classified by Ninetails are included (reads marked "unclassified" are omitted from the analysis).
Ninetails also produces summary table of non-adenosine occurrences within analyzed dataset.
summarized <- ninetails::summarize_nonA(merged_nonA_tables=merged_nonA_tables,
summary_factors="group",
transcript_id_column="ensembl_transcript_id_short")
In the output table, counts are understood as the number of reads in total or containing a given type of non-adenosine residue (see column headers for details). Whereas hits are understood as the number of occurrences of a given separate instance of non-adenosine in total (see column headers for details). Please be aware that there may be several hits in one read.
The function also reports the mean and median poly(A) tail length by transcript.
Ninetails has been developed in the Laboratory of RNA Biology (Dziembowski Lab) at the International Institute of Molecular and Cell Biology in Warsaw.