Bug fix in RNA calling pipeline #475

gouwh29 · 2022-12-06T12:20:56Z

Dear Prof. Wang and other mates:

It has been a long time since I leave this project team. Currently, I am at a new lab and doing some work. Our RNA-seq data is very suitable for me to fully utilize & testing the existing pipeline.

Basically, I am now using the RNA_calling pipeline and have experienced these problems:

No pipeline is designed for single-ended RNA-seq data in adaptor trimming (using Trimmomatic)
Our data have multiple adaptor sequences in each lane. I tested your recommended tool fastp but it did not work very properly. We go back to the Trimmomatic and the following new workflow can be used to deal with SE RNA-seq:

[trimmomatic_trim_adaptor_SE]
# Path to the software. Default set to using our rna_quantification.sif image
parameter: trimmomatic_jar = "/opt/Trimmomatic-0.39/trimmomatic-0.39.jar"
# Illumina clip setting
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
parameter: seed_mismatches = 2
parameter: palindrome_clip_threshold = 30
parameter: simple_clip_threshold = 10
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# Other settings
parameter: leading = 3
parameter: trailing = 3
parameter: min_len = 50
input: fastq, group_by = is_paired_end + 1, group_with = "sample_id"
output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \
        ${_input} \
        ${_output} \
        ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \
        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}

Small bug in fastqc:
The line unzip -o ${_output[0]:n}.zip -d ${cwd} will leads to an error when using SE data. Though this may not have any impact on the end result.
Some small bugs are in STAR_align_1 main workflow:
- The first is about the line --sjdbOverhang ${sjdbOverhang if sjdbOverhang != 0 else _read_length} . In some cases, one may not know the read length in constructing the very first sample.list. Only after QC you may have idea of the average read length. In the documentation of STAR, it claimed that using the default value 100 will work as well as the optimal value. If the read length column is not provided at first stage, there will raise an error. This will also affect the input definition in line input: fastq,group_by = is_paired_end + 1,group_with = {"sample_id","read_length"}
- The second is about the rm -r ${_output[0]:nnnn}._STARtmp. It seems that it should be replaced by rm -r ${_output[0]:nnnn}_STARpass1 (based on our running result)
Another small bug in the rsem_call_4, rnaseqc_call_4
In R function readPicard.alignment_summary_metrics. To read this file, the line m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=2), the nrow = 2 parameter should be corrected to nrow = 1 as the function included the header line. Otherwise, there will raise an error.
Other typos: fasta.fasta typo in the minimum example

Thanks a lot for this wonderful pipeline, which literally freed a bunch of my time. Hope all these bug reports will be helpful on your end too!

The text was updated successfully, but these errors were encountered:

gaow · 2022-12-06T13:16:05Z

Great to hear from you @Gou-29 ! For those obvious bugs could you send in PR so we can evaluate and merge? For others please leave them as discussions

hsun3163 · 2022-12-06T20:18:48Z

Hi @Gou-29 , thank you for pointing out the issues. Re:

The first is about the line --sjdbOverhang ${sjdbOverhang if sjdbOverhang != 0 else _read_length} . In some cases, one may not know the read length in constructing the very first sample.list. Only after QC you may have idea of the average read length. In the documentation of STAR, it claimed that using the default value 100 will work as well as the optimal value. If the read length column is not provided at first stage, there will raise an error. This will also affect the input definition in line input: fastq,group_by = is_paired_end + 1,group_with = {"sample_id","read_length"}

Add the --sjdbOverhang 100 to the command should overwrite the read_length arguments. I wonder could you try to see if that works while your read_length column is empty?

Regarding the issue with _STARpass1, per the developer of STAR, this folder should contain some potentially useful information. So we don't remove it by default.

Please feel free to implement other changes and send in a PR.

gouwh29 · 2022-12-07T18:53:22Z

I also have a small question about the message:
detected 88 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable and the usage of csg.yml file in core allocation. Is there any reference file for possible inspection?

fix cumc#475

hsun3163 · 2022-12-21T19:53:25Z

@Gou-29 Hi I wonder is the new version of RNA_Seq sufficient to accommodate your need for SE RNA processing

gouwh29 · 2022-12-22T12:03:42Z

@hsun3163 Thanks a lot for your update then! All things ran smoothly then. The only question on my side is that my results for RNA-ScQC results seem to be wrong (i.e. all the quantification in .gct is 0). However, the rsem_call can return the correct result like the tpm/fpkm matrix. I believe this may due to my data itself. I will check and give you further update.

gaow · 2022-12-22T14:48:29Z

@Gou-29 for the samples with wrong gct matrices, could you share with us one example for the multi-QC output? We can see if the reads quality are low and RNASEQC filtered them out as a result ...

hsun3163 · 2022-12-22T18:11:29Z

Hi @Gou-29 It most likely is not due to your data. Would u mind pull the new update and try again?

gouwh29 · 2022-12-23T06:24:01Z

@hsun3163 @gaow I have tested the new pipeline and the RNA-ScQC works smoothly. A very weird bug here is that when generating the multiple QC report, the name of the report will be adding some suffix: ideal one: xxx/input_trimmed.multiqc_report.html; Generated on xxxinput_trimmed.multiqc_report_{1}.html ) if the previous workflow is ended by Ctrl + C. Once I manually changed the name to the corrected one, the workflow can run till the end (i.e. to rnaseq_call_4.

Further, I may still not be able to use >8 cores on my end by parsing --numThreads 64.

Our server will shut down for Christmas for energy saving. I will be able to test updates after 26th.

Thanks a lot for all your efforts and Merry Xmas!

gouwh29 mentioned this issue Dec 7, 2022

Update RNA_calling.ipynb #476

Closed

gouwh29 added a commit to gouwh29/xqtl-pipeline that referenced this issue Dec 9, 2022

Update RNA_calling.ipynb

5bb1dc3

fix cumc#475

gouwh29 mentioned this issue Dec 9, 2022

Update RNA_calling.ipynb #478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix in RNA calling pipeline #475

Bug fix in RNA calling pipeline #475

gouwh29 commented Dec 6, 2022 •

edited

Loading

gaow commented Dec 6, 2022 •

edited

Loading

hsun3163 commented Dec 6, 2022 •

edited

Loading

gouwh29 commented Dec 7, 2022

hsun3163 commented Dec 21, 2022

gouwh29 commented Dec 22, 2022

gaow commented Dec 22, 2022

hsun3163 commented Dec 22, 2022 •

edited

Loading

gouwh29 commented Dec 23, 2022 •

edited

Loading

Bug fix in RNA calling pipeline #475

Bug fix in RNA calling pipeline #475

Comments

gouwh29 commented Dec 6, 2022 • edited Loading

gaow commented Dec 6, 2022 • edited Loading

hsun3163 commented Dec 6, 2022 • edited Loading

gouwh29 commented Dec 7, 2022

hsun3163 commented Dec 21, 2022

gouwh29 commented Dec 22, 2022

gaow commented Dec 22, 2022

hsun3163 commented Dec 22, 2022 • edited Loading

gouwh29 commented Dec 23, 2022 • edited Loading

gouwh29 commented Dec 6, 2022 •

edited

Loading

gaow commented Dec 6, 2022 •

edited

Loading

hsun3163 commented Dec 6, 2022 •

edited

Loading

hsun3163 commented Dec 22, 2022 •

edited

Loading

gouwh29 commented Dec 23, 2022 •

edited

Loading