This project is a comprehensive RNA-seq analysis pipeline designed to perform differential expression analysis, co-expression network analysis (using WGCNA), and various data visualizations. The analysis is based on publicly available GEO datasets and uses RNA-seq raw count data. It integrates key steps in the analysis of transcriptomic data, such as normalization, differential expression testing, and co-expression analysis, with clear visual outputs for biological interpretation.
- Automated RNA-seq Pipeline: The entire process from loading GEO datasets to generating publication-quality visualizations is fully automated.
- Differential Expression Analysis: The pipeline uses DESeq2 to identify differentially expressed genes across conditions and outputs various visualizations.
- Co-expression Network Analysis (WGCNA): WGCNA is used to identify modules of co-expressed genes and generate a network structure based on expression similarities.
- Interactive Visualizations: Generates easy-to-interpret visual outputs, including PCA, heatmaps, volcano plots, and MA plots.
- Gene ID Mapping: Automatically maps NCBI Entrez Gene IDs to HGNC gene symbols using the org.Hs.eg.db database for easy interpretation.
The volcano plot shows significant genes based on log2 fold change and adjusted p-values (padj). Genes with higher fold changes and statistical significance are highlighted in red.
PCA plot of the variance-stabilized data helps to assess sample separation based on the top principal components, offering insight into the major sources of variance in the data.
The MA plot visualizes log2 fold changes against the mean expression for all genes, where significant genes are highlighted.
The heatmap provides a visualization of the expression patterns of the top 50 differentially expressed genes across samples. Each row represents a gene, and each column represents a sample. The color intensity indicates the relative expression level after centering (subtracting the row mean). Samples are clustered by similarity to reveal patterns associated with different conditions, such as “normal” and “tumor,” allowing for quick identification of co-expression patterns and differences between groups.
This dendrogram displays co-expressed gene modules identified through hierarchical clustering in WGCNA. Modules are color-coded based on gene similarity, providing a visual grouping of genes that are likely co-regulated. Each branch represents a module with genes sharing similar expression patterns, potentially associated with specific biological functions.
The module-trait relationship heatmap provides a summary of the correlations between gene modules and external sample traits (e.g., tumor vs. normal). Each row represents a module (colored by module name, such as "blue" or "turquoise"), and each column represents a sample trait. The color of each cell reflects the strength and direction of the correlation: red for positive correlations, blue for negative correlations, and white for no correlation.
The heatmap legend indicates the correlation scale from -1 to 1. Modules with strong positive or negative correlations to a trait may contain genes involved in processes related to that condition. This heatmap allows for quick identification of potentially biologically relevant modules associated with specific sample traits, making it a useful tool for further investigation into gene-trait relationships.
-
Data Preprocessing:
- Download sample metadata from GEO.
- Load the RNA-seq count data.
- Map gene IDs (Entrez to HGNC symbols).
-
Differential Expression:
- Run DESeq2 to perform differential expression analysis between conditions.
- Output results as CSV and generate a volcano plot to visualize significant genes.
-
Normalization:
- Normalize the RNA-seq data using variance stabilizing transformation (VST).
-
Co-expression Analysis:
- Perform WGCNA to detect gene modules that are co-expressed.
- Output dendrogram plots and visualize co-expression networks.
- R: The pipeline is built using R and Bioconductor packages.
- Packages Used:
- DESeq2: For differential expression analysis and normalization.
- WGCNA: For co-expression network analysis.
- GEOquery: To download and process GEO datasets.
- ggplot2, pheatmap: For generating heatmaps and visualizations.
- EnhancedVolcano: For creating volcano plots.
- clusterProfiler: For KEGG pathway enrichment analysis.
- org.Hs.eg.db: For gene ID mapping.
- differential_expression_results.csv: Contains log fold changes, p-values, and adjusted p-values for each gene.
- Volcano Plot, PCA Plot, Heatmap, MA Plot: High-quality figures summarizing the analysis.
- Co-expression Network Results: Identifies gene modules and network structure.
# Example Usage for GEO dataset GSE37764
geo_id <- "GSE37764"
counts_file <- "path_to/GSE37764_raw_counts_GRCh38.p13_NCBI.tsv"