Added EEG lab

sdrangan · Feb 14, 2019 · 89bf194 · 89bf194
1 parent d9a8aa9
commit 89bf194
Show file tree

Hide file tree

Showing 3 changed files with 378 additions and 1 deletion.
diff --git a/lectures/Lect05_Lasso.pptx b/lectures/Lect05_Lasso.pptx
diff --git a/sequence.md b/sequence.md
@@ -61,7 +61,7 @@ to documents that don't match the number of the unit.  We will fix these soon!
     * Lecture:  LASSO Regularization [[pdf]](./lectures/Lect05_Lasso.pdf)
      [[Powerpoint]](./lectures/Lect05_Lasso.pptx)         
     * [Demo:  Finding predictors of prostate cancer](./unit05_lasso/demo_prostate.ipynb) 
-    * [Lab: Student performance prediction](./unit05_lasso/lab_student-performance.ipynb) 
+    * [Lab: EEG source localization](./unit05_lasso/lab_eeg_partial.ipynb) 
     * Problems [[pdf]](./unit05_lasso/prob/prob_lasso.pdf) [[Latex]](./unit05_lasso/prob/prob_lasso.tex)
 * [Unit 6:  Logistic regression](./unit06_logistic/readme.md)
     * Lecture:  Linear classification and logistic regression

diff --git a/unit05_lasso/lab_eeg_partial.ipynb b/unit05_lasso/lab_eeg_partial.ipynb
@@ -0,0 +1,377 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Lab:  Source Localization for EEG\n",
+    "\n",
+    "EEG or [Electroencephalography](https://en.wikipedia.org/wiki/Electroencephalography) is a powerful tool for neuroscientists in understanding brain activity.  In EEG, a patient wears a headset with electrodes that measures voltages at a number of points on the scalp.  These voltages arise from ionic currents within the brain.  A common *inverse problem* is to estimate the which parts of the brain caused the measured response.  Source localization is useful in understanding which parts of the brain are involved in certain tasks.  A key challenge in this inverse problem is that the number of unknowns (possible locations in the brain) is much larger than the number of measurements.  In this lab, we will use LASSO regression on a real EEG dataset to overcome this problem and determine the brain region that is active under an auditory stimulus.\n",
+    "\n",
+    "In addition to the concepts in the [prostate LASSO demo](./demo_prostate.ipynb) you will learn to:\n",
+    "* Represent responses of multi-channel time-series data, such as EEG, using linear models\n",
+    "* Perform LASSO and Ridge regression\n",
+    "* Select the regularization level via cross-validation\n",
+    "* Visually compare the sparsity between the solutions\n",
+    "\n",
+    "We first download standard packages."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "import pickle\n",
+    "\n",
+    "from sklearn.linear_model import Lasso, Ridge, ElasticNet\n",
+    "from sklearn.metrics import r2_score\n",
+    "from sklearn.model_selection import train_test_split"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Load the Data\n",
+    "\n",
+    "The data in this lab is taken from one of the sample datasets in the [MNE website](https://martinos.org/mne/stable/index.html). The sample data is a recording from one subject who experienced some auditory stimulus on the left ear.    \n",
+    "\n",
+    "The raw data is very large (`1.5G`) and also requires that you install the `mne` python package. To make this lab easier, I have extracted and processed a small section of the data. The following command will download a `pickle` file `eeg_dat.p` to your local machine.  If you do want to create the data yourself, the program to create the data is in this directory in the github repository."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "File eeg_dat.p is already downloaded\n"
+     ]
+    }
+   ],
+   "source": [
+    "fn_src ='https://drive.google.com/uc?export=download&id=1RzQpKONOcXSMxH2ZzOI4iVMiTgD6ttSl'\n",
+    "fn_dst ='eeg_dat.p'\n",
+    "\n",
+    "import os\n",
+    "from six.moves import urllib\n",
+    "\n",
+    "if os.path.isfile(fn_dst):\n",
+    "    print('File %s is already downloaded' % fn_dst)\n",
+    "else:        \n",
+    "    print('Fetching file %s [53MB].  This may take a minute..' % fn_dst)\n",
+    "    urllib.request.urlretrieve(fn_src, fn_dst)\n",
+    "    print('File %s downloaded' % fn_dst)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now run the following command which will get the data from the `pickle` file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pickle\n",
+    "fn = 'eeg_dat.p'\n",
+    "with open(fn, 'rb') as fp:\n",
+    "    [X,Y] = pickle.load(fp)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To understand the data, there are three key variables:\n",
+    "* `nt`    = number of time steps that we measure data\n",
+    "* `nchan` = number of channels (i.e. electrodes) measured in each time step\n",
+    "* `ncur`  = number of currents in the brain that we want to estimate.  \n",
+    "\n",
+    "Each current comes from one brain region (called a *voxel*) in either the `x`, `y` or `z` direction.  So,\n",
+    "\n",
+    "    nvoxels = ncur / 3\n",
+    "    \n",
+    "The components of the `X` and `Y` matrices are:\n",
+    "*  `Y[i,k]` = electric field measurement on channel `i` at time `k`\n",
+    "*  `X[i,j]` = sensitivity of channel `i` to current `j`.\n",
+    "\n",
+    "Using `X.shape` and `Y.shape` compute and print `nt`, `nchan`, `ncur` and `nvoxels`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO\n",
+    "# nt = ...\n",
+    "# ncur = ...\n",
+    "# nchan = ...\n",
+    "# nvoxels"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Ridge Regression\n",
+    "\n",
+    "Our goal is to estimate the currents in the brain from the measurements `Y`.  One simple linear model is:\n",
+    "\n",
+    "    Y[i,k]  = \\sum_j X[i,j]*W[j,k]+ b[k]\n",
+    "\n",
+    "where `W[j,k]` is the value of current `j` at time `k` and `b[k]` is a bias.  We can solve for the current matrix `W` via linear regression.  \n",
+    "\n",
+    "Howeever, there is a problem:\n",
+    "*  There are `nt x ncur` unknowns in `W`\n",
+    "*  There are only `nt x nchan` measurements in `Y`.\n",
+    "\n",
+    "In this problem, we have:\n",
+    "\n",
+    "    number of measurements  << number of unknowns\n",
+    "    \n",
+    "We need to use regularization in these circumstances.  We first try Ridge regression.\n",
+    "\n",
+    "First split the data into training and test.  Use the `train_test_split` function with `test_size=0.33`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO\n",
+    "# Xtr,Xts,Ytr,Yts = train_test_split(...) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Use the `Ridge` regression object in `sklearn` to fit the model on the training data.  Use a regularization, `alpha=1`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO\n",
+    "# regr = Ridge(...)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Preict the values `Y` on both the training and test data.  Use the `r2_score` method to measure the `R^2` value on both the training and test.  You will see that `R^2` value is large for the training data, it is very low for the test data.  This suggest that even with regularization, the model is over-fitting the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO\n",
+    "# rsq_tr = ...\n",
+    "# rsq_ts = ..."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, try to see if we can get a better `R^2` score using different values of `alpha`.  Use cross-validation to measure the test `R^2` for 20 `alpha` values logarithmically spaced from `10^{-2}` to `10^{2}` (use `np.logspace()`).  You can use regular cross-validation.  You do not need to do `K`-fold."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Plot the test `R^2` vs. `alpha`.  And print the maximum test `R^2`.  You should see that the maximum test `R^2` is still not very high."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, let's take a look at the solution. \n",
+    "\n",
+    "* Find the optimal regularization `alpha` from the cross-validation\n",
+    "* Re-fit the model at the optimal `alpha`\n",
+    "* Get the current matrix `W` from the coefficients in the linear model.  These are stored in `regr.coef_`.  You may need a transpose\n",
+    "* For each current `j` compute `Wrms[j] =  sqrt( sum_k W[j,k]**2 )` which is root mean squared current.\n",
+    "\n",
+    "You will see that the vector `Wrms` is not sparse.  This means that the solution that is found with Ridge regression finds currents in all locations."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## LASSO Regression\n",
+    "\n",
+    "We can improve the estimate by imposing sparsity.  Biologically, we know that only a limited number of brain regions should be involved in the reponse to a particular stimuli.  As a result, we would expect that the current matrix `W[j,k]` to be zero for most values `j,k`.  We can impose this constraint using LASSO regularization.\n",
+    "\n",
+    "Re-fit the training data using the `Lasso` model with `alpha=1e-3`.  Also set `max_iter=100` and `tol=0.01`.  The LASSO solver is much slower, so this make take a minute."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now, test the model on the test data and measure the `R^2` value.  You should get a much better fit than with the Ridge regression solution.  "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now search for the optimal `alpha`.  Use cross-validation to find the `alpha` logarithically space between `alpha=10^{-3}` and `alpha=10^{-4}`.  Each fit takes some time, so use only 5 values of `alpha`. Also for each `alpha` store the current matrix.  This way, you will not have to re-fit the model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Plot the `r^2` value vs. `alpha`.  Print the optimal `r^2`.  You should see it is much higher than with the best Ridge Regression case."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Display the current matrix `W` for the optimal `alpha` as you did in the Ridge Regression case.  You will see that is much sparser."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# TODO"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## More fun\n",
+    "\n",
+    "If you want to more on this lab:\n",
+    "* Install the [MNE python package](https://martinos.org/mne/stable/index.html).  This is an amazing package with many tools for processing EEG data.\n",
+    "* In particular, you can use the above results to visualize where in the brain the currents sources are.\n",
+    "* You can also improve the fitting with more regularization.  For example, we know that the currents will be non-zero in groups:  If the current is non-zero for one time, it is likely to non-zero for all time.  You can use the Group LASSO method.\n",
+    "* You can combine these results to make predictions about what the patient is seeing or hearing or thinking."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}