Update RNA_calling.ipynb

fix cumc#475
gouwh29 · Dec 9, 2022 · 5bb1dc3 · 5bb1dc3
1 parent b55738b
commit 5bb1dc3
Showing 1 changed file with 64 additions and 22 deletions.
diff --git a/code/molecular_phenotypes/calling/RNA_calling.ipynb b/code/molecular_phenotypes/calling/RNA_calling.ipynb
@@ -100,6 +100,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "rough-robinson",
    "metadata": {
@@ -116,6 +117,7 @@
     "sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150\n",
     "sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75\n",
     "```\n",
+    "**Warning**: For single-ended (SE) data, your input data should exclude the `fq2` column. For other 2 columns, the `strand` column is optional (the `strand` information can be detected in `STAR_output` pipeline) and the `read_length` colunm (can be validated/infered from `fastqc` report) is **required**. If no prior information is avaiable for `read_length`, leave the column empty\n",
     "\n",
     "All the fastq files should be available under specified folder (default assumes the same folder as where the meta-data file is)."
    ]
@@ -195,11 +197,12 @@
     "    --data-dir data \\\n",
     "    --STAR-index reference_data/STAR_Index/ \\\n",
     "    --container containers/rna_quantification.sif \\\n",
-    "    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
+    "    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
     "    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat "
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "f04410f9-6721-4bd5-8aea-ef9150103347",
    "metadata": {
@@ -208,7 +211,9 @@
    "source": [
     "To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 3 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.\n",
     "\n",
-    "***Warning***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)"
+    "***Warning1***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)\n",
+    "\n",
+    "***Warning2***: If the `read_length` colunm is empty in your original meta-data file `sample_fastq.list`, please add following arg: `--sjdbOverhang 100` (default value of the `STAR` alignment) or `--sjdbOverhang {read_length}` (if you can deduce the correct read length from the QC report) in `STAR_output` pipeline. The default value 100 will be sufficient to give correct results. "
    ]
   },
   {
@@ -259,7 +264,7 @@
     "    --STAR-index reference_data/STAR_Index/ \\\n",
     "    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \\\n",
     "    --container containers/rna_quantification.sif \\\n",
-    "    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
+    "    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
     "    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
     "    --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
    ]
@@ -291,21 +296,11 @@
     "    --RSEM-index reference_data/RSEM_Index/ \\\n",
     "    --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf \\\n",
     "    --container containers/rna_quantification.sif \\\n",
-    "    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
+    "    --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
     "    --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
     "    --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b5689265-c9d3-40a6-adf9-21349a342e68",
-   "metadata": {
-    "kernel": "Bash"
-   },
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "id": "temporal-worse",
@@ -578,11 +573,10 @@
    "source": [
     "[fastqc]\n",
     "input: fastq, group_by =  1\n",
-    "output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc/fastqc_data.txt' \n",
+    "output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' \n",
     "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
     "bash: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container\n",
-    "    fastqc ${_input} -o ${_output[0]:d}\n",
-    "    unzip -o ${_output[0]:n}.zip -d ${cwd}"
+    "    fastqc ${_input} -o ${_output[0]:d}"
    ]
   },
   {
@@ -676,6 +670,7 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
    "id": "registered-shooting",
    "metadata": {
@@ -688,7 +683,7 @@
     "\n",
     "Documentation: [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)\n",
     "\n",
-    "We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`.\n",
+    "We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`. **PE data and SE data have different pipline** (see below)\n",
     "\n",
     "### Step Inputs\n",
     "\n",
@@ -707,7 +702,9 @@
     ">For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.\n",
     "\n",
     "\n",
-    "**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo."
+    "**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo.\n",
+    "\n",
+    "* For pair-ended (PE) data:"
    ]
   },
   {
@@ -720,7 +717,7 @@
    },
    "outputs": [],
    "source": [
-    "[trimmomatic_trim_adaptor]\n",
+    "[trimmomatic_trim_adaptor_PE]\n",
     "# Path to the software. Default set to using our rna_quantification.sif image\n",
     "parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
     "# Illumina clip setting\n",
@@ -754,6 +751,52 @@
     "        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
    ]
   },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "id": "21b1c985",
+   "metadata": {},
+   "source": [
+    "* For single-ended (SE) data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "turkish-alfred",
+   "metadata": {
+    "kernel": "SoS",
+    "tags": []
+   },
+   "outputs": [],
+   "source": [
+    "[trimmomatic_trim_adaptor_SE]\n",
+    "# Path to the software. Default set to using our rna_quantification.sif image\n",
+    "parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
+    "# Illumina clip setting\n",
+    "# Path to the reference adaptors\n",
+    "parameter: fasta_with_adapters_etc = path(\".\")\n",
+    "parameter: seed_mismatches = 2\n",
+    "parameter: palindrome_clip_threshold = 30\n",
+    "parameter: simple_clip_threshold = 10\n",
+    "# sliding window setting\n",
+    "parameter: window_size = 4\n",
+    "parameter: required_quality = 20\n",
+    "# Other settings\n",
+    "parameter: leading = 3\n",
+    "parameter: trailing = 3\n",
+    "parameter: min_len = 50\n",
+    "input: fastq, group_by = is_paired_end + 1, group_with = \"sample_id\"\n",
+    "output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'\n",
+    "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
+    "bash: container=container, expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
+    "    java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \\\n",
+    "        ${_input} \\\n",
+    "        ${_output} \\\n",
+    "        ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \\\n",
+    "        LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "unnecessary-necessity",
@@ -876,7 +919,6 @@
     "        --sjdbGTFfile ${gtf} \\\n",
     "\n",
     "    rm -r ${_output[0]:nnnn}._STARgenome\n",
-    "    rm -r ${_output[0]:nnnn}._STARtmp\n",
     "    touch ${_output[0]:n}.sort_start.timestamp\n",
     "    samtools sort --threads ${numThreads} -o ${_output[0]:nnnn}.Aligned.sortedByCoord.out.bam ${_output[0]:nnnn}.Aligned.out.bam # According to GTEX, this can help reducing the amount of memory consumption.\n",
     "    rm ${_output[0]:nnnn}.Aligned.out.bam \n",
@@ -1504,7 +1546,7 @@
     "    \n",
     "      metrics <- list()\n",
     "      for (i in 1:length(files)) {\n",
-    "        m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=2)\n",
+    "        m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=1)\n",
     "        metrics[[i]] <- data.frame(Sample=samples[i],\n",
     "                                   File=files[i],\n",
     "                                   PF_READS=sum(m$PF_READS[1:2]),\n",