Skip to content

rohitrrj/RNAseq_Pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNA-seq Analysis Pipeline

Status NGS License

A comprehensive pipeline for processing RNA sequencing data using STAR aligner and generating gene expression counts. This pipeline automates the entire workflow from raw FastQ files to gene count matrix generation.

Table of Contents

Features

  • Automated processing of paired-end RNA-seq data
  • Quality control using FastQC
  • Efficient read alignment using STAR
  • Gene-level quantification using featureCounts
  • Support for multiple samples
  • Parallel processing capabilities
  • Comprehensive logging and error reporting

Prerequisites

  • Python
  • FastQC (v0.11.2+)
  • STAR (v2.5.3a+)
  • Subread/featureCounts (v1.6.0+)
  • Samtools (v1.3+)
  • R (v3.2.2+)
  • Reference genome and annotation files
  • Sufficient computational resources (recommended: 32GB+ RAM)

Installation

  1. Clone the repository:
git clone https://github.com/rohitrrj/RNAseq_Pipeline.git
cd RNAseq_Pipeline
  1. Ensure all required modules are available:
module load python fastqc/0.11.2 STAR/2.5.3a subread/1.6.0
module load samtools/1.3 r/3.2.2
  1. Configure your project:
cp conf.txt.example conf.txt
# Edit conf.txt with your project-specific paths

Usage

  1. Prepare your input data:
  • Place paired-end FastQ files in the data directory
  • Naming convention: sample_R1_001.fastq.gz and sample_R2_001.fastq.gz
  1. Set up configuration (conf.txt):
myDATADIR="/path/to/fastq/files"
myGenomeDIR="/path/to/reference/genome"
myGenomeGTF="/path/to/annotation.gtf"
N_CPUS=8  # Number of CPU cores to use
  1. Run the pipeline:
./RNA_seq_pipeline_STAR.sh

Pipeline Steps

  1. Quality Control (FastQC)

    • Raw read quality assessment
    • Adapter content analysis
    • Quality metrics visualization
  2. Read Alignment (STAR)

    • Genome loading
    • Splice-aware alignment
    • BAM file generation
  3. Expression Quantification (featureCounts)

    • Gene-level count generation
    • Multi-threaded processing
    • Comprehensive counting statistics

Output Structure

project_directory/
├── fastQC_output/           # Quality control reports
│   ├── *_fastqc.html
│   └── *_fastqc.zip
├── star_output/            # STAR alignment results
│   ├── *Aligned.out.bam
│   ├── *Log.final.out
│   └── *SJ.out.tab
└── featureCount_output/    # Gene count matrices
    ├── *.count.txt
    └── *.count.txt.summary

Configuration

Edit conf.txt to specify:

# Required paths
myDATADIR="/path/to/data"              # FastQ files location
myGenomeDIR="/path/to/genome"          # Reference genome directory
myGenomeGTF="/path/to/annotation.gtf"  # Gene annotation file
STAR_HG19_GENOME="/path/to/star/index" # STAR genome index
N_CPUS=8                              # Number of CPU cores

# Optional parameters
h_vmem="10G"                          # Memory per core
h_rt="24:00:00"                       # Maximum runtime

STAR Alignment Parameters

Key alignment parameters used:

--outSAMstrandField intronMotif       # Include strand field
--outFilterIntronMotifs RemoveNoncanonical  # Filter non-canonical junctions
--outSAMtype BAM Unsorted             # Output unsorted BAM
--outReadsUnmapped Fastx              # Save unmapped reads

featureCounts Parameters

Gene quantification settings:

-t exon           # Feature type
-g gene_id        # Attribute type
-T $N_CPUS        # Number of threads

Troubleshooting

Common Issues

  1. Memory Issues

    • Increase h_vmem in script header
    • Reduce number of parallel processes
    • Consider using smaller chunks of data
  2. STAR Alignment Errors

    • Verify genome index
    • Check disk space
    • Validate input FastQ format
  3. featureCounts Problems

    • Verify GTF file format
    • Check BAM file integrity
    • Ensure sufficient file permissions

Error Messages

  • STAR: command not found - Module not loaded correctly
  • ERROR: can't open GTF file - Check file path and permissions
  • ERROR: no input files specified - Verify FastQ file naming

Performance Optimization

  1. Resource Allocation

    • Adjust N_CPUS based on system
    • Balance memory per core
    • Monitor disk I/O
  2. File Management

    • Use SSD for temporary files
    • Clean up intermediate files
    • Implement staged processing

License

This project is licensed under the MIT License - see the LICENSE file for details.

Applications

This pipeline has been used in the following publications:

  1. "PD-1 combination therapy with IL-2 modifies CD8+ T cell exhaustion program"

  2. "Aging-associated HELIOS deficiency in naive CD4+ T cells alters chromatin remodeling and promotes effector cell responses"

Code availability: ⭐ rohitrrj/RNAseq_Pipeline - High-throughput RNA sequencing analysis pipeline

Contributing

Contributions are welcome! Please read the contributing guidelines before submitting pull requests.

Acknowledgments

  • STAR aligner development team
  • Subread/featureCounts developers
  • Supporting institutions and funding

About

Pipeline for Analysis of RNA sequencing data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published