From a7f9558f87f8d00e2bcc48657359347786ad5283 Mon Sep 17 00:00:00 2001 From: Andrew Thrasher Date: Tue, 10 Dec 2024 13:04:57 -0500 Subject: [PATCH] feat: add investigation jupyter notebook --- workflows/methylation/methylation.ipynb | 777 ++++++++++++++++++++++++ 1 file changed, 777 insertions(+) create mode 100644 workflows/methylation/methylation.ipynb diff --git a/workflows/methylation/methylation.ipynb b/workflows/methylation/methylation.ipynb new file mode 100644 index 000000000..fab263ddc --- /dev/null +++ b/workflows/methylation/methylation.ipynb @@ -0,0 +1,777 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "9b33a5c2", + "metadata": {}, + "source": [ + "# Methylation Evaluation\n", + "\n", + "The [minfi](https://htmlpreview.github.io/?https://github.com/hansenlab/tutorial.450k/blob/master/inst/doc/methylation450k.html) package is commonly used for methylation array analysis. Normalization is handled by the included subset-quantile within array normalization (SWAN) method. SWAN normalizes methylation array data by selecting a subset of probes from both types (Infinium I and Infinium II).\n", + "\n", + "The SWAN publication describes this method thusly:\n", + "\n", + "> The SWAN method has two parts. First, an average quantile distribution is determined using a subset of probes defined to be biologically similar based on CpG content. This is achieved by randomly selecting N Infinium I and II probes that have one, two and three underlying CpGs, where N is the minimum number of probes in the six sets of Infinium I and II probes with one, two and three probe body CpGs. If no probes have been filtered out - for example, sex chromosome probes, and so on - N = 11,303. This results in a pool of 3N Infinium I and 3N Infinium II probes. Due to the vast differences in their distributions (Figure 2), the subsequent processing is performed independently on both the methylated and unmethylated channels. The subset for each probe type, from each channel (methylated or unmethylated), is sorted by increasing intensity. The value of each of the 3N pairs of observations is then assigned to be the mean intensity of the two probe types for that row or 'quantile'. This is the standard quantile procedure. The second step is to then adjust the intensities of the remaining probes, of which there are many more Infinium II than I, by interpolation onto the distribution of the subset probes. This is done for each probe type separately using linear interpolation between the subset probes to define the new intensities. Consequently, while the distribution of the subset is identical, the intensity distribution of Infinium I probes is still vastly different from the distribution of Infinium II probes (Figure S2 in Additional file 1).\n", + "\n", + "This description suggests that individual samples can be normalized separately since the distribution of each channel within a sample is adjusted based on the subset of probes selected. The only difference between independent runs of SWAN is the selection of the subset of probes. This can be controlled by setting a seed in R (`set.seed(N)`).\n" + ] + }, + { + "cell_type": "markdown", + "id": "e419ad06", + "metadata": {}, + "source": [ + "# Setup\n", + "This evaluation requires a number of packages to be installed in the environment. To accomplish this, I have used (conda)[https://conda.io/projects/conda/en/latest/user-guide/install/index.html] along with the `bioconda`, `r`, and `conda-forge` channels. " + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "ff8551b3", + "metadata": {}, + "outputs": [], + "source": [ + "%%bash\n", + "# Uncomment the lines below to create the conda environment.\n", + "# - I'm not necessarily advocating you do this in this notebook, \n", + "# probably want to set it up in a terminal and then start this Jupyter notebook after :).\n", + "\n", + "# conda create -n methylation \\\n", + "# -c anaconda \\\n", + "# -c bioconda \\\n", + "# -c conda-forge \\\n", + "# numpy \\\n", + "# pandas \\\n", + "# r-dplyr \\\n", + "# bioconductor-minfi \\\n", + "# bioconductor-illuminahumanmethylationepicmanifest \\\n", + "# bioconductor-illuminahumanmethylationepicanno.ilm10b4.hg19 \\\n", + "# \n", + "# -y\n", + "# conda init bash\n", + "# source ~/.bashrc\n", + "# conda activate methylation" + ] + }, + { + "cell_type": "markdown", + "id": "e9159ddb", + "metadata": {}, + "source": [ + "# Analysis" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "a2979573", + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import numpy as np" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "35d375b7", + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext rpy2.ipython" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "68bd9475", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[1] \"R version 4.2.2 (2022-10-31)\"\n" + ] + } + ], + "source": [ + "%%R\n", + "R.version.string" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "e32f3f19", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: \n", + "Attaching package: ‘dplyr’\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:stats’:\n", + "\n", + " filter, lag\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:base’:\n", + "\n", + " intersect, setdiff, setequal, union\n", + "\n", + "\n", + "R[write to console]: Loading required package: BiocGenerics\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘BiocGenerics’\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:dplyr’:\n", + "\n", + " combine, intersect, setdiff, union\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:stats’:\n", + "\n", + " IQR, mad, sd, var, xtabs\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:base’:\n", + "\n", + " anyDuplicated, aperm, append, as.data.frame, basename, cbind,\n", + " colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,\n", + " get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,\n", + " match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,\n", + " Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,\n", + " table, tapply, union, unique, unsplit, which.max, which.min\n", + "\n", + "\n", + "R[write to console]: Loading required package: GenomicRanges\n", + "\n", + "R[write to console]: Loading required package: stats4\n", + "\n", + "R[write to console]: Loading required package: S4Vectors\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘S4Vectors’\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:dplyr’:\n", + "\n", + " first, rename\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:base’:\n", + "\n", + " expand.grid, I, unname\n", + "\n", + "\n", + "R[write to console]: Loading required package: IRanges\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘IRanges’\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:dplyr’:\n", + "\n", + " collapse, desc, slice\n", + "\n", + "\n", + "R[write to console]: Loading required package: GenomeInfoDb\n", + "\n", + "R[write to console]: Loading required package: SummarizedExperiment\n", + "\n", + "R[write to console]: Loading required package: MatrixGenerics\n", + "\n", + "R[write to console]: Loading required package: matrixStats\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘matrixStats’\n", + "\n", + "\n", + "R[write to console]: The following object is masked from ‘package:dplyr’:\n", + "\n", + " count\n", + "\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘MatrixGenerics’\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:matrixStats’:\n", + "\n", + " colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,\n", + " colCounts, colCummaxs, colCummins, colCumprods, colCumsums,\n", + " colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,\n", + " colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,\n", + " colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,\n", + " colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,\n", + " colWeightedMeans, colWeightedMedians, colWeightedSds,\n", + " colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,\n", + " rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,\n", + " rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,\n", + " rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,\n", + " rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,\n", + " rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,\n", + " rowWeightedMads, rowWeightedMeans, rowWeightedMedians,\n", + " rowWeightedSds, rowWeightedVars\n", + "\n", + "\n", + "R[write to console]: Loading required package: Biobase\n", + "\n", + "R[write to console]: Welcome to Bioconductor\n", + "\n", + " Vignettes contain introductory material; view with\n", + " 'browseVignettes()'. To cite Bioconductor, see\n", + " 'citation(\"Biobase\")', and for packages 'citation(\"pkgname\")'.\n", + "\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘Biobase’\n", + "\n", + "\n", + "R[write to console]: The following object is masked from ‘package:MatrixGenerics’:\n", + "\n", + " rowMedians\n", + "\n", + "\n", + "R[write to console]: The following objects are masked from ‘package:matrixStats’:\n", + "\n", + " anyMissing, rowMedians\n", + "\n", + "\n", + "R[write to console]: Loading required package: Biostrings\n", + "\n", + "R[write to console]: Loading required package: XVector\n", + "\n", + "R[write to console]: \n", + "Attaching package: ‘Biostrings’\n", + "\n", + "\n", + "R[write to console]: The following object is masked from ‘package:base’:\n", + "\n", + " strsplit\n", + "\n", + "\n", + "R[write to console]: Loading required package: bumphunter\n", + "\n", + "R[write to console]: Loading required package: foreach\n", + "\n", + "R[write to console]: Loading required package: iterators\n", + "\n", + "R[write to console]: Loading required package: parallel\n", + "\n", + "R[write to console]: Loading required package: locfit\n", + "\n", + "R[write to console]: locfit 1.5-9.9 \t 2024-03-01\n", + "\n", + "R[write to console]: Setting options('download.file.method.GEOquery'='auto')\n", + "\n", + "R[write to console]: Setting options('GEOquery.inmemory.gpl'=FALSE)\n", + "\n" + ] + } + ], + "source": [ + "%%R\n", + "#Load libraries\n", + "library(dplyr)\n", + "library(minfi)\n", + "library(IlluminaHumanMethylationEPICmanifest)\n", + "library(IlluminaHumanMethylationEPICanno.ilm10b4.hg19)\n" + ] + }, + { + "cell_type": "markdown", + "id": "a2889683", + "metadata": {}, + "source": [ + "For reproducibility, we'll fix the seed value." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "14b9392d", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "set.seed(1)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "48095c75", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: [read.metharray] Reading 201533640034_R01C01_Grn.idat\n", + "\n", + "R[write to console]: [read.metharray] Reading 201533640034_R01C01_Red.idat\n", + "\n", + "R[write to console]: [read.metharray] Read idat files in 0.5 seconds\n", + "\n", + "R[write to console]: [read.metharray] Creating data matrices ... \n", + "R[write to console]: done in 1.4 seconds\n", + "\n", + "R[write to console]: [read.metharray] Instantiating final object ... \n", + "R[write to console]: done in 0.1 seconds\n", + "\n" + ] + } + ], + "source": [ + "%%R\n", + "RGSet <- read.metharray(\"~/data/COMET/methylation_data/201533640034_R01C01\", verbose = TRUE, force = TRUE)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "55fb7210", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "IlluminaMethylationManifest object\n", + "Annotation\n", + " array: IlluminaHumanMethylationEPIC\n", + "Number of type I probes: 142262 \n", + "Number of type II probes: 724574 \n", + "Number of control probes: 635 \n", + "Number of SNP type I probes: 21 \n", + "Number of SNP type II probes: 38 \n" + ] + } + ], + "source": [ + "%%R\n", + "manifest <- getManifest(RGSet)\n", + "manifest" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "c60d11c2", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "# Load raw data into a MethylSet object be converting red/green\n", + "# channels into a matrix of methlyated and unmethylated signals\n", + "MSet <- preprocessRaw(RGSet)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "a2d59506", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "# Convert to a RatioSet\n", + "RSet <- ratioConvert(MSet, what = \"both\", keepCN = TRUE)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "417eb789", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "# Add genomic coordinates to each probe (plus additional annotation)\n", + "GRset <- mapToGenome(RSet)" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "aa587a12", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "# Take the genomic mapped RatioSet and fill Beta values (non-normalized).\n", + "# Get the NON-normalized beta values:\n", + "beta <- getBeta(GRset)" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "9e58e6c1", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "# Get M-value matrix and copy-number matrix\n", + "M <- getM(GRset)\n", + "CN <- getCN(GRset)" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "91362629", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[1] \"201533640034_R01C01\"\n" + ] + } + ], + "source": [ + "%%R\n", + "# Get sample names\n", + "sampleNames <- sampleNames(GRset)\n", + "sampleNames" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "d22b2b8c", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[1] \"201533640034_R01C01\"\n" + ] + } + ], + "source": [ + "%%R\n", + "# Get probe names\n", + "sampleNames <- sampleNames(GRset)\n", + "sampleNames" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "9245521c", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "gr <- granges(GRset)\n", + "annotation <- getAnnotation(GRset)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "f90c3817", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "set.seed(1)\n", + "# Perform SWAN normalization on beta values\n", + "GRset.swan_norm <- preprocessSWAN(RGSet)\n", + "beta_swan_norm <- getBeta(GRset.swan_norm)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "0b7cf238", + "metadata": {}, + "outputs": [], + "source": [ + "%%R -o beta_swan_norm_genomic\n", + "# Write the normalized beta-values that have NOT yet had\n", + "# low-variance probes filtered out\n", + "# Filter to only those that are mappable to the genome.\n", + "RSet_genomic <- ratioConvert(GRset.swan_norm)\n", + "GRset_genomic <- mapToGenome(RSet_genomic)\n", + "beta_swan_norm_genomic <- getBeta(GRset_genomic)\n", + "colnames(beta_swan_norm_genomic) <- colnames(CN)" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "5cdadd3e", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[0.8319394 ]\n", + " [0.87825323]\n", + " [0.74041298]\n", + " ...\n", + " [0.84366217]\n", + " [0.5464304 ]\n", + " [0.41737797]]\n", + "(865859, 1)\n" + ] + } + ], + "source": [ + "print(beta_swan_norm_genomic)\n", + "print(beta_swan_norm_genomic.shape)" + ] + }, + { + "cell_type": "markdown", + "id": "5b2d63ec", + "metadata": {}, + "source": [ + "Read in data for a second sample." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "bb4029f8", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "R[write to console]: [read.metharray] Reading 201533640009_R08C01_Grn.idat\n", + "\n", + "R[write to console]: [read.metharray] Reading 201533640009_R08C01_Red.idat\n", + "\n", + "R[write to console]: [read.metharray] Read idat files in 0.4 seconds\n", + "\n", + "R[write to console]: [read.metharray] Creating data matrices ... \n", + "R[write to console]: done in 0.9 seconds\n", + "\n", + "R[write to console]: [read.metharray] Instantiating final object ... \n", + "R[write to console]: done in 0.0 seconds\n", + "\n" + ] + } + ], + "source": [ + "%%R\n", + "second_RGSet <- read.metharray(\"~/data/COMET/methylation_data/201533640009_R08C01\", verbose = TRUE, force = TRUE)" + ] + }, + { + "cell_type": "markdown", + "id": "2a663c43", + "metadata": {}, + "source": [ + "Now we'll combine the data for both samples so that we can analysis them together." + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "1fc133bc", + "metadata": {}, + "outputs": [], + "source": [ + "%%R\n", + "combined <- combine(RGSet, second_RGSet)" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "id": "617781c2", + "metadata": {}, + "outputs": [], + "source": [ + "%%R -o combined_beta_swan_norm_genomic\n", + "set.seed(1)\n", + "combined_GRset <- mapToGenome(combined)\n", + "combined_CN <- getCN(combined_GRset)\n", + "combined_GRset.swan_norm <- preprocessSWAN(combined)\n", + "combined_RSet_genomic <- ratioConvert(combined_GRset.swan_norm)\n", + "combined_GRset_genomic <- mapToGenome(combined_RSet_genomic)\n", + "combined_beta_swan_norm_genomic <- getBeta(combined_GRset_genomic)\n", + "colnames(combined_beta_swan_norm_genomic) <- colnames(combined_CN)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "id": "d27feb1f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[0.8319394 0.91933784]\n", + " [0.87825323 0.86565846]\n", + " [0.74041298 0.74228972]\n", + " ...\n", + " [0.84366217 0.81228846]\n", + " [0.5464304 0.49253039]\n", + " [0.41737797 0.17487666]]\n" + ] + } + ], + "source": [ + "print(combined_beta_swan_norm_genomic)" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "9f4d53c0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[0.8319394 0.87825323 0.74041298 ... 0.84366217 0.5464304 0.41737797]\n" + ] + } + ], + "source": [ + "sample_from_combined = combined_beta_swan_norm_genomic[:,0]\n", + "print(sample_from_combined)" + ] + }, + { + "cell_type": "markdown", + "id": "7168cae7", + "metadata": {}, + "source": [ + "Now to check equality between the two runs of the normalization." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "6e2f7273", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "False" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "np.array_equal(beta_swan_norm_genomic, sample_from_combined)" + ] + }, + { + "cell_type": "markdown", + "id": "ae081d99", + "metadata": {}, + "source": [ + "The inequality is not a surprise. These are very long floating point numbers, so some difference is expected due to rounding. So we need to compare with a tolerance.\n", + "\n", + "Numpy offers the `allclose` function which compares array values within a tolerance. However, it crashes this kernel, presumably, due to the size of the arrays." + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "009aae4b", + "metadata": {}, + "outputs": [], + "source": [ + "#np.allclose(beta_swan_norm_genomic, sample_from_combined)" + ] + }, + { + "cell_type": "markdown", + "id": "02490b35", + "metadata": {}, + "source": [ + "Instead we will iterate over the arrays and check manually. Then check to see that all values are within the allowable tolerance." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "b25c0f4a", + "metadata": {}, + "outputs": [], + "source": [ + "value = []\n", + "atol = 1e-8\n", + "rtol = 1e-5\n", + "for a,b in zip(beta_swan_norm_genomic, sample_from_combined):\n", + " value.append(abs(a - b) <= (atol + rtol * abs(b)))" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "77f48a13", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "True\n" + ] + } + ], + "source": [ + "print(all(value))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4a0d57fd", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}