A comprehensive pipeline for processing RNA sequencing data using STAR aligner and generating gene expression counts. This pipeline automates the entire workflow from raw FastQ files to gene count matrix generation.
- Features
- Prerequisites
- Installation
- Usage
- Pipeline Steps
- Output Structure
- Configuration
- Troubleshooting
- Automated processing of paired-end RNA-seq data
- Quality control using FastQC
- Efficient read alignment using STAR
- Gene-level quantification using featureCounts
- Support for multiple samples
- Parallel processing capabilities
- Comprehensive logging and error reporting
- Python
- FastQC (v0.11.2+)
- STAR (v2.5.3a+)
- Subread/featureCounts (v1.6.0+)
- Samtools (v1.3+)
- R (v3.2.2+)
- Reference genome and annotation files
- Sufficient computational resources (recommended: 32GB+ RAM)
- Clone the repository:
git clone https://github.com/rohitrrj/RNAseq_Pipeline.git
cd RNAseq_Pipeline
- Ensure all required modules are available:
module load python fastqc/0.11.2 STAR/2.5.3a subread/1.6.0
module load samtools/1.3 r/3.2.2
- Configure your project:
cp conf.txt.example conf.txt
# Edit conf.txt with your project-specific paths
- Prepare your input data:
- Place paired-end FastQ files in the data directory
- Naming convention:
sample_R1_001.fastq.gz
andsample_R2_001.fastq.gz
- Set up configuration (
conf.txt
):
myDATADIR="/path/to/fastq/files"
myGenomeDIR="/path/to/reference/genome"
myGenomeGTF="/path/to/annotation.gtf"
N_CPUS=8 # Number of CPU cores to use
- Run the pipeline:
./RNA_seq_pipeline_STAR.sh
-
Quality Control (FastQC)
- Raw read quality assessment
- Adapter content analysis
- Quality metrics visualization
-
Read Alignment (STAR)
- Genome loading
- Splice-aware alignment
- BAM file generation
-
Expression Quantification (featureCounts)
- Gene-level count generation
- Multi-threaded processing
- Comprehensive counting statistics
project_directory/
├── fastQC_output/ # Quality control reports
│ ├── *_fastqc.html
│ └── *_fastqc.zip
├── star_output/ # STAR alignment results
│ ├── *Aligned.out.bam
│ ├── *Log.final.out
│ └── *SJ.out.tab
└── featureCount_output/ # Gene count matrices
├── *.count.txt
└── *.count.txt.summary
Edit conf.txt
to specify:
# Required paths
myDATADIR="/path/to/data" # FastQ files location
myGenomeDIR="/path/to/genome" # Reference genome directory
myGenomeGTF="/path/to/annotation.gtf" # Gene annotation file
STAR_HG19_GENOME="/path/to/star/index" # STAR genome index
N_CPUS=8 # Number of CPU cores
# Optional parameters
h_vmem="10G" # Memory per core
h_rt="24:00:00" # Maximum runtime
Key alignment parameters used:
--outSAMstrandField intronMotif # Include strand field
--outFilterIntronMotifs RemoveNoncanonical # Filter non-canonical junctions
--outSAMtype BAM Unsorted # Output unsorted BAM
--outReadsUnmapped Fastx # Save unmapped reads
Gene quantification settings:
-t exon # Feature type
-g gene_id # Attribute type
-T $N_CPUS # Number of threads
-
Memory Issues
- Increase h_vmem in script header
- Reduce number of parallel processes
- Consider using smaller chunks of data
-
STAR Alignment Errors
- Verify genome index
- Check disk space
- Validate input FastQ format
-
featureCounts Problems
- Verify GTF file format
- Check BAM file integrity
- Ensure sufficient file permissions
STAR: command not found
- Module not loaded correctlyERROR: can't open GTF file
- Check file path and permissionsERROR: no input files specified
- Verify FastQ file naming
-
Resource Allocation
- Adjust N_CPUS based on system
- Balance memory per core
- Monitor disk I/O
-
File Management
- Use SSD for temporary files
- Clean up intermediate files
- Implement staged processing
This project is licensed under the MIT License - see the LICENSE file for details.
This pipeline has been used in the following publications:
-
"PD-1 combination therapy with IL-2 modifies CD8+ T cell exhaustion program"
- Nature. 2022 Oct;610(7933):737-743
- DOI: 10.1038/s41586-022-05257-0
- PMID: 36171288
- PMCID: PMC9793890
- Used for transcriptome analysis of exhausted T cells
-
"Aging-associated HELIOS deficiency in naive CD4+ T cells alters chromatin remodeling and promotes effector cell responses"
- Nat Immunol.. 2023 Jan;24(1):96-109
- DOI: 10.1038/s41590-022-01369-x
- PMID: 36510022
- PMCID: PMC10118794
- Used for analyzing bone marrow T cell progenitor transcriptome
Code availability: ⭐ rohitrrj/RNAseq_Pipeline - High-throughput RNA sequencing analysis pipeline
Contributions are welcome! Please read the contributing guidelines before submitting pull requests.
- STAR aligner development team
- Subread/featureCounts developers
- Supporting institutions and funding