Skip to content

Commit

Permalink
Update RNA_calling.ipynb
Browse files Browse the repository at this point in the history
  • Loading branch information
gouwh29 committed Dec 9, 2022
1 parent b55738b commit 5bb1dc3
Showing 1 changed file with 64 additions and 22 deletions.
86 changes: 64 additions & 22 deletions code/molecular_phenotypes/calling/RNA_calling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "rough-robinson",
"metadata": {
Expand All @@ -116,6 +117,7 @@
"sample_2 samp2_r1.fq.gz samp2_r2.fq.gz fr 150\n",
"sample_3 samp3_r1.fq.gz samp3_r2.fq.gz strand_missing 75\n",
"```\n",
"**Warning**: For single-ended (SE) data, your input data should exclude the `fq2` column. For other 2 columns, the `strand` column is optional (the `strand` information can be detected in `STAR_output` pipeline) and the `read_length` colunm (can be validated/infered from `fastqc` report) is **required**. If no prior information is avaiable for `read_length`, leave the column empty\n",
"\n",
"All the fastq files should be available under specified folder (default assumes the same folder as where the meta-data file is)."
]
Expand Down Expand Up @@ -195,11 +197,12 @@
" --data-dir data \\\n",
" --STAR-index reference_data/STAR_Index/ \\\n",
" --container containers/rna_quantification.sif \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
" --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat "
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "f04410f9-6721-4bd5-8aea-ef9150103347",
"metadata": {
Expand All @@ -208,7 +211,9 @@
"source": [
"To align the reads with STAR and generate the bam_list recipe for downstream molecular phenotype count matrixes. The `-J 3 -c csg.yml -q csg` part is crucial for it ask for the required memory to conduct the STAR alignment.\n",
"\n",
"***Warning***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)"
"***Warning1***: It is of crucial importance that, the STAR Alignment shall be ran with the gtf file that were generated `before collapsing to gene`, i.e. the one that are used to generated RSEM index. Other wise there will be an [error while running RSEM later on](https://github.com/cumc/xqtl-pipeline/issues/390)\n",
"\n",
"***Warning2***: If the `read_length` colunm is empty in your original meta-data file `sample_fastq.list`, please add following arg: `--sjdbOverhang 100` (default value of the `STAR` alignment) or `--sjdbOverhang {read_length}` (if you can deduce the correct read length from the QC report) in `STAR_output` pipeline. The default value 100 will be sufficient to give correct results. "
]
},
{
Expand Down Expand Up @@ -259,7 +264,7 @@
" --STAR-index reference_data/STAR_Index/ \\\n",
" --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gene.gtf \\\n",
" --container containers/rna_quantification.sif \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
" --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
" --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
]
Expand Down Expand Up @@ -291,21 +296,11 @@
" --RSEM-index reference_data/RSEM_Index/ \\\n",
" --gtf reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf \\\n",
" --container containers/rna_quantification.sif \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta.fasta \\\n",
" --reference-fasta reference_data/GRCh38_full_analysis_set_plus_decoy_hla.noALT_noHLA_noDecoy.fasta \\\n",
" --ref-flat reference_data/Homo_sapiens.GRCh38.103.chr.reformated.ERCC.gtf.ref.flat \\\n",
" --bam_list output/rnaseq/sample_fastq_bam_list -J 3 -c csg.yml -q csg"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5689265-c9d3-40a6-adf9-21349a342e68",
"metadata": {
"kernel": "Bash"
},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"id": "temporal-worse",
Expand Down Expand Up @@ -578,11 +573,10 @@
"source": [
"[fastqc]\n",
"input: fastq, group_by = 1\n",
"output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc/fastqc_data.txt' \n",
"output: f'{cwd}/{_input:bn}_fastqc.html',f'{cwd}/{_input:bn}_fastqc.zip' \n",
"task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
"bash: expand= \"${ }\", stderr = f'{_output[0]:n}.stderr', stdout = f'{_output[0]:n}.stdout', container=container\n",
" fastqc ${_input} -o ${_output[0]:d}\n",
" unzip -o ${_output[0]:n}.zip -d ${cwd}"
" fastqc ${_input} -o ${_output[0]:d}"
]
},
{
Expand Down Expand Up @@ -676,6 +670,7 @@
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "registered-shooting",
"metadata": {
Expand All @@ -688,7 +683,7 @@
"\n",
"Documentation: [Trimmomatic](http://www.usadellab.org/cms/?page=trimmomatic)\n",
"\n",
"We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`.\n",
"We have replaced this with `fastp`, see above, which performs better than `Trimmomatic` in terms of removing adaptors that `Trimmomatic` cannot detect. `fastp` can also automatically guess the adapter sequences from data and by default no adapter sequence is required for input to `fastp`. **PE data and SE data have different pipline** (see below)\n",
"\n",
"### Step Inputs\n",
"\n",
Expand All @@ -707,7 +702,9 @@
">For single-ended data, one input and one output file are specified, plus the processing steps. For paired-end data, two input files are specified, and 4 output files, 2 for the 'paired' output where both reads survived the processing, and 2 for corresponding 'unpaired' output where a read survived, but the partner read did not.\n",
"\n",
"\n",
"**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo."
"**You need to figure out from fastqc results what adapter reference sequence to use.**, eg `--fasta_with_adapters_etc TruSeq3-PE.fa`. These files can be downloaded from `Trimmomatic` github repo.\n",
"\n",
"* For pair-ended (PE) data:"
]
},
{
Expand All @@ -720,7 +717,7 @@
},
"outputs": [],
"source": [
"[trimmomatic_trim_adaptor]\n",
"[trimmomatic_trim_adaptor_PE]\n",
"# Path to the software. Default set to using our rna_quantification.sif image\n",
"parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
"# Illumina clip setting\n",
Expand Down Expand Up @@ -754,6 +751,52 @@
" LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "21b1c985",
"metadata": {},
"source": [
"* For single-ended (SE) data:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "turkish-alfred",
"metadata": {
"kernel": "SoS",
"tags": []
},
"outputs": [],
"source": [
"[trimmomatic_trim_adaptor_SE]\n",
"# Path to the software. Default set to using our rna_quantification.sif image\n",
"parameter: trimmomatic_jar = \"/opt/Trimmomatic-0.39/trimmomatic-0.39.jar\"\n",
"# Illumina clip setting\n",
"# Path to the reference adaptors\n",
"parameter: fasta_with_adapters_etc = path(\".\")\n",
"parameter: seed_mismatches = 2\n",
"parameter: palindrome_clip_threshold = 30\n",
"parameter: simple_clip_threshold = 10\n",
"# sliding window setting\n",
"parameter: window_size = 4\n",
"parameter: required_quality = 20\n",
"# Other settings\n",
"parameter: leading = 3\n",
"parameter: trailing = 3\n",
"parameter: min_len = 50\n",
"input: fastq, group_by = is_paired_end + 1, group_with = \"sample_id\"\n",
"output: f'{cwd}/{_sample_id}_trimmed_{_input:bn}.gz'\n",
"task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads\n",
"bash: container=container, expand= \"${ }\", stderr = f'{_output:n}.stderr', stdout = f'{_output:n}.stdout'\n",
" java -jar -Xmx${java_mem} ${trimmomatic_jar} SE -threads ${numThreads} \\\n",
" ${_input} \\\n",
" ${_output} \\\n",
" ILLUMINACLIP:${fasta_with_adapters_etc}:${seed_mismatches}:${palindrome_clip_threshold}:${simple_clip_threshold} \\\n",
" LEADING:${leading} TRAILING:${trailing} SLIDINGWINDOW:${window_size}:${required_quality} MINLEN:${min_len}"
]
},
{
"cell_type": "markdown",
"id": "unnecessary-necessity",
Expand Down Expand Up @@ -876,7 +919,6 @@
" --sjdbGTFfile ${gtf} \\\n",
"\n",
" rm -r ${_output[0]:nnnn}._STARgenome\n",
" rm -r ${_output[0]:nnnn}._STARtmp\n",
" touch ${_output[0]:n}.sort_start.timestamp\n",
" samtools sort --threads ${numThreads} -o ${_output[0]:nnnn}.Aligned.sortedByCoord.out.bam ${_output[0]:nnnn}.Aligned.out.bam # According to GTEX, this can help reducing the amount of memory consumption.\n",
" rm ${_output[0]:nnnn}.Aligned.out.bam \n",
Expand Down Expand Up @@ -1504,7 +1546,7 @@
" \n",
" metrics <- list()\n",
" for (i in 1:length(files)) {\n",
" m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=2)\n",
" m <- read.table(files[i], header=TRUE, sep=\"\\t\", comment.char=\"#\", stringsAsFactors=FALSE, nrows=1)\n",
" metrics[[i]] <- data.frame(Sample=samples[i],\n",
" File=files[i],\n",
" PF_READS=sum(m$PF_READS[1:2]),\n",
Expand Down

0 comments on commit 5bb1dc3

Please sign in to comment.