Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix in RNA calling pipeline #475

Open
gouwh29 opened this issue Dec 6, 2022 · 8 comments
Open

Bug fix in RNA calling pipeline #475

gouwh29 opened this issue Dec 6, 2022 · 8 comments

Comments

@gouwh29
Copy link
Contributor

gouwh29 commented Dec 6, 2022

Dear Prof. Wang and other mates:

It has been a long time since I leave this project team. Currently, I am at a new lab and doing some work. Our RNA-seq data is very suitable for me to fully utilize & testing the existing pipeline.

Basically, I am now using the RNA_calling pipeline and have experienced these problems:

  • No pipeline is designed for single-ended RNA-seq data in adaptor trimming (using Trimmomatic)
    Our data have multiple adaptor sequences in each lane. I tested your recommended tool fastp but it did not work very properly. We go back to the Trimmomatic and the following new workflow can be used to deal with SE RNA-seq:
[trimmomatic_trim_adaptor_SE]
# Path to the software. Default set to using our rna_quantification.sif image
parameter: trimmomatic_jar = "/opt/Trimmomatic-0.39/trimmomatic-0.39.jar"
# Illumina clip setting
# Path to the reference adaptors
parameter: fasta_with_adapters_etc = path(".")
parameter: seed_mismatches = 2
parameter: palindrome_clip_threshold = 30
parameter: simple_clip_threshold = 10
# sliding window setting
parameter: window_size = 4
parameter: required_quality = 20
# Other settings
parameter: leading = 3
parameter: trailing = 3
parameter: min_len = 50
input: fastq, group_by = is_paired_end + 1, group_with = "sample_id"
output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'
task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads
bash: container=container, expand= "${ }", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'
    java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \
        ${_input} \
        ${_output} \
        ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \
        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}
  • Small bug in fastqc:
    The line unzip -o ${_output[0]:n}.zip -d ${cwd} will leads to an error when using SE data. Though this may not have any impact on the end result.
  • Some small bugs are in STAR_align_1 main workflow:
    • The first is about the line --sjdbOverhang ${sjdbOverhang if sjdbOverhang != 0 else _read_length} . In some cases, one may not know the read length in constructing the very first sample.list. Only after QC you may have idea of the average read length. In the documentation of STAR, it claimed that using the default value 100 will work as well as the optimal value. If the read length column is not provided at first stage, there will raise an error. This will also affect the input definition in line input: fastq,group_by = is_paired_end + 1,group_with = {"sample_id","read_length"}
    • The second is about the rm -r ${_output[0]:nnnn}._STARtmp. It seems that it should be replaced by rm -r ${_output[0]:nnnn}_STARpass1 (based on our running result)
  • Another small bug in the rsem_call_4, rnaseqc_call_4
    In R function readPicard.alignment_summary_metrics. To read this file, the line m <- read.table(files[i], header=TRUE, sep="\t", comment.char="#", stringsAsFactors=FALSE, nrows=2), the nrow = 2 parameter should be corrected to nrow = 1 as the function included the header line. Otherwise, there will raise an error.
  • Other typos: fasta.fasta typo in the minimum example

Thanks a lot for this wonderful pipeline, which literally freed a bunch of my time. Hope all these bug reports will be helpful on your end too!

@gaow
Copy link
Contributor

gaow commented Dec 6, 2022

Great to hear from you @Gou-29 ! For those obvious bugs could you send in PR so we can evaluate and merge? For others please leave them as discussions

@hsun3163
Copy link
Collaborator

hsun3163 commented Dec 6, 2022

Hi @Gou-29 , thank you for pointing out the issues. Re:

The first is about the line --sjdbOverhang ${sjdbOverhang if sjdbOverhang != 0 else _read_length} . In some cases, one may not know the read length in constructing the very first sample.list. Only after QC you may have idea of the average read length. In the documentation of STAR, it claimed that using the default value 100 will work as well as the optimal value. If the read length column is not provided at first stage, there will raise an error. This will also affect the input definition in line input: fastq,group_by = is_paired_end + 1,group_with = {"sample_id","read_length"}

Add the --sjdbOverhang 100 to the command should overwrite the read_length arguments. I wonder could you try to see if that works while your read_length column is empty?

Regarding the issue with _STARpass1, per the developer of STAR, this folder should contain some potentially useful information. So we don't remove it by default.

Please feel free to implement other changes and send in a PR.

@gouwh29
Copy link
Contributor Author

gouwh29 commented Dec 7, 2022

I also have a small question about the message:
detected 88 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable and the usage of csg.yml file in core allocation. Is there any reference file for possible inspection?

gouwh29 added a commit to gouwh29/xqtl-pipeline that referenced this issue Dec 9, 2022
@hsun3163
Copy link
Collaborator

@Gou-29 Hi I wonder is the new version of RNA_Seq sufficient to accommodate your need for SE RNA processing

@gouwh29
Copy link
Contributor Author

gouwh29 commented Dec 22, 2022

@hsun3163 Thanks a lot for your update then! All things ran smoothly then. The only question on my side is that my results for RNA-ScQC results seem to be wrong (i.e. all the quantification in .gct is 0). However, the rsem_call can return the correct result like the tpm/fpkm matrix. I believe this may due to my data itself. I will check and give you further update.

@gaow
Copy link
Contributor

gaow commented Dec 22, 2022

@Gou-29 for the samples with wrong gct matrices, could you share with us one example for the multi-QC output? We can see if the reads quality are low and RNASEQC filtered them out as a result ...

@hsun3163
Copy link
Collaborator

hsun3163 commented Dec 22, 2022

Hi @Gou-29 It most likely is not due to your data. Would u mind pull the new update and try again?

@gouwh29
Copy link
Contributor Author

gouwh29 commented Dec 23, 2022

@hsun3163 @gaow I have tested the new pipeline and the RNA-ScQC works smoothly. A very weird bug here is that when generating the multiple QC report, the name of the report will be adding some suffix: ideal one: xxx/input_trimmed.multiqc_report.html; Generated on xxxinput_trimmed.multiqc_report_{1}.html ) if the previous workflow is ended by Ctrl + C. Once I manually changed the name to the corrected one, the workflow can run till the end (i.e. to rnaseq_call_4.

Further, I may still not be able to use >8 cores on my end by parsing --numThreads 64.

Our server will shut down for Christmas for energy saving. I will be able to test updates after 26th.

Thanks a lot for all your efforts and Merry Xmas!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants