-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathhw5.txt
345 lines (237 loc) · 11 KB
/
hw5.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
Homework 5 (due Tuesday, November 12, at 11:59pm PST)
#####################################################
For solutions, see :doc:`hw5-solutions`.
1. Fitting
----------
Do a linear fit to each of the Anscombe's Quartet data sets, and plot
the lines and data points. Here is some simple example code for doing
a linear fit to x and y in Python::
x = [1,2,3,4]
y = [3,5,7,10] # 10, not 9, so the fit isn't perfect
fit = polyfit(x,y,1)
fit_fn = poly1d(fit) # fit_fn is now a function which takes in x and returns an estimate for y
plot(x,y, 'yo', x, fit_fn(x), '--k')
Submit as usual (done in an IPython Notebook, submitted via a gist; send
me the nbviewer URL).
2. Variant calling
------------------
(Please do either this, or expression analysis, or both. You only need to do one.)
The below data is from one of Rich Lenski's LTEE papers, the one on
`the evolution of citrate consumption in LTEE
<http://www.nature.com/nature/journal/v489/n7417/full/nature11514.html>`__.
Install software
~~~~~~~~~~~~~~~~
You'll want an m1.large or m1.xlarge for this.
First, we need to install the `BWA aligner
<http://bio-bwa.sourceforge.net/>`__::
cd /root
wget -O bwa-0.7.5.tar.bz2 http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.5a.tar.bz2/download
tar xvfj bwa-0.7.5.tar.bz2
cd bwa-0.7.5a
make
cp bwa /usr/local/bin
We also need a new version of `samtools <http://samtools.sourceforge.net/>`__::
cd /root
curl -O -L http://sourceforge.net/projects/samtools/files/samtools/0.1.19/samtools-0.1.19.tar.bz2
tar xvfj samtools-0.1.19.tar.bz2
cd samtools-0.1.19
make
cp samtools /usr/local/bin
cp bcftools/bcftools /usr/local/bin
cd misc/
cp *.pl maq2sam-long maq2sam-short md5fa md5sum-lite wgsim /usr/local/bin/
Download data
~~~~~~~~~~~~~
Download the reference genome and the resequencing reads::
cd /mnt
curl -O http://athyra.idyll.org/~t/REL606.fa.gz
gunzip REL606.fa.gz
curl -O ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR098/SRR098042/SRR098042.fastq.gz
Note, this last URL is the "Fastq files (FTP)" link from the European
Nucleotide Archive (ENA) for this sample:
http://www.ebi.ac.uk/ena/data/view/SRR098042.
Do the mapping
~~~~~~~~~~~~~~
Now let's map all of the reads to the reference. Start by indexing the
reference genome::
cd /mnt
bwa index REL606.fa
Now, do the mapping of the raw reads to the reference genome::
bwa aln REL606.fa SRR098042.fastq.gz > SRR098042.sai
Make a SAM file (this would be done with 'sampe' if these were paired-end
reads)::
bwa samse REL606.fa SRR098042.sai SRR098042.fastq.gz > SRR098042.sam
This file contains all of the information about where each read hits
on the reference.
Next, index the reference genome with samtools::
samtools faidx REL606.fa
Convert the SAM into a BAM file::
samtools import REL606.fa.fai SRR098042.sam SRR098042.bam
Sort the BAM file::
samtools sort SRR098042.bam SRR098042.sorted
And index the sorted BAM file::
samtools index SRR098042.sorted.bam
At this point you can visualize with tview or Tablet.
'samtools tview' is a text interface that you use from the command
line; run it like so::
samtools tview SRR098042.sorted.bam REL606.fa
The '.'s are places where the reads align perfectly in the forward direction,
and the ','s are places where the reads align perfectly in the reverse
direction. Mismatches are indicated as A, T, C, G, etc.
You can scroll around using left and right arrows; to go to a specific
coordinate, use 'g' and then type in the contig name and the position.
For example, type 'g' and then 'rel606:553093<ENTER>' to go to
position 553093 in the BAM file.
For the `Tablet viewer <http://bioinf.scri.ac.uk/tablet/>`__, click on
the link and get it installed on your local computer. Then, start it
up as an application. To open your alignments in Tablet, you'll need
three files on your local computer: ``REL606.fa``, ``SRR098042.sorted.bam``,
and ``SRR098042.sorted.bam.bai``. You can copy them over using Dropbox,
for example.
Calling SNPs
~~~~~~~~~~~~
You can use samtools to call SNPs like so::
samtools mpileup -uD -f REL606.fa SRR098042.sorted.bam | bcftools view -bvcg - > SRR098042.raw.bcf
(See the 'mpileup' docs `here <http://samtools.sourceforge.net/mpileup.shtml>`__.)
Now convert the BCF into VCF::
bcftools view SRR098042.raw.bcf > SRR098042.vcf
Examining SNPs
~~~~~~~~~~~~~~
You can load the VCF file with this code in IPython notebook::
import csv
def load_vcf(filename):
fp = open(filename, 'rb')
r = csv.reader(fp, delimiter='\t')
snp_calls = []
for n, row in enumerate(r):
if row[0].startswith('#'):
continue
chr, pos, _, ref, alt = row[:5]
pos = int(pos)
snp_calls.append((chr, pos, ref, alt))
return snp_calls
The full filename should be ``/mnt/SRR098042.vcf``.
----
Look at a few of the locations in tview or Tablet. Do you believe
the variant calls?
Problems
~~~~~~~~
1. Download the FASTQ.GZ file for SRR098038, and call SNPs.
2. Compare a few the SNPs in SRR098038 with the SNPs from SRR098042.
Are the SNPs in one a subset or superset of the SNPs in the other?
3. SRR098038 is one of the Cit+ strains from 38,000 generations. Go find the
location of the mutS mutation by finding the mutS gene in `the genome of the reference strain here <http://www.ncbi.nlm.nih.gov/nuccore/254160123?report=graph&tracks=[key:gene_model_track,name:Genes---Unnamed,display_name:Genes,annots:Unnamed,Options:MergePairs,SNPs:false,CDSProductFeats:true,ShowLabelsForAllFeatures:true][key:feature_track,name:Genes---other,display_name:Genes%20-%20other,subkey:ncRNA,annots:other,Layout:Adaptive][key:feature_track,name:Genes---tRNA,display_name:Genes%20-%20tRNA,subkey:tRNA,annots:tRNA,Layout:Adaptive][key:SNP_track,name:SNP,annots:SNP,Layout:Adaptive][key:SNP_Bins_track,name:Clinical%20Variants][key:SNP_Bins_track,name:Cited%20Variants][key:GWAS_track,name:GWAS%20Tracks,display_name:Association%20Results]>`__. Do you find the mutation? What is it, precisely? (Coordinates & ref -> variant.)
Hand in your two VCF files, together with your answer for #3.
.. @Talk to Rich and Jeff.
3. RNAseq expression analysis
-----------------------------
(Please do either this, or variant calling, or both. You only need to do one.)
Here we're going to look at RNAseq differential expression. We're
going to be using data generated from running the `Eel Pond mRNAseq
protocol <https://khmer-protocols.readthedocs.org/en/latest/>`__. You
can feel free to run through all of this -- it goes through a complete
de novo assembly and quantitation of a bunch of mRNAseq data -- but the
whole thing takes about a week, so ... you might not want to. Below,
I've given you access to some of the output files instead :).
Install software
~~~~~~~~~~~~~~~~
You'll want an m1.large or m1.xlarge for this.
Start by installing R and some R packages::
apt-get -y install r-base-core r-cran-gplots
Next, install the `EBSeq
<http://www.biostat.wisc.edu/~kendzior/EBSEQ/>`__ package for
calculating differential expression, which comes as part of `RSEM
<http://deweylab.biostat.wisc.edu/rsem/>`__, the package we used in
khmer-protocols for calculating expression levels::
cd /root
curl -O http://deweylab.biostat.wisc.edu/rsem/src/rsem-1.2.7.tar.gz
tar xzf rsem-1.2.7.tar.gz
cd rsem-1.2.7
make
cd EBSeq
make
echo 'export PATH=$PATH:/root/rsem-1.2.7' >> ~/.bashrc
source ~/.bashrc
Download data
~~~~~~~~~~~~~
Create a working directory::
cd /mnt
mkdir diffexpr
cd diffexpr
curl -O http://athyra.idyll.org/~t/rsem-data.tar.gz
tar xvf rsem-data.tar.gz
This will result in four files, ``{1,2,6,7}.fq.genes.results``. These
are the first two 0 Hour and the first two 6 Hour time points from the
`Tulin et al. study <http://www.evodevojournal.com/content/4/1/16>`__
on Nematostella development, run through the entire Eel Pond protocol
(see khmer-protocols, above).
Calculate differential expression
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To do the differential expression calculation with EBSeq::
cd /mnt/diffexpr
rsem-generate-data-matrix {1,2,6,7}.fq.genes.results > genes.matrix
rsem-run-ebseq genes.matrix 2,2 genes.changed
Here, the '2,2' means there are 2 conditions, each with two replicates.
The EBSeq output will be in 'genes.changed'. `Read the docs
<http://deweylab.biostat.wisc.edu/rsem/rsem-run-ebseq.html>`__ to
understand what's in the output file -- you're most interested in the
PPDE (posterior probability that a transcript is differentially expressed)
and the PostFC (posterior fold change) columns, columns 4 and 5.
Problem: Load in data & plot
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Please build an IPython Notebook that loads in the sample 1 and
sample 6 data sets and plots the expression levels of genes in
sample 1 on the X axis and sample 6 on the Y axis; below is some
code taken from the `in class notebook <http://nbviewer.ipython.org/urls/raw.github.com/beacon-center/2013-intro-computational-science/master/notebooks/day5-mRNAseq-quant.ipynb>`__ that you can adapt.
2. Load in the 'genes.changed' file (hint: you only need column 1) and
plot the locations of genes that have changed.
A tip: instead of iterating over all the keys in the sample1 dictionary,
you can iterate over just the keys in the dictionary created to hold
the results of the 'genes.changed' file.
3. Grab the results in http://athyra.idyll.org/~t/rsem-data2.tar.gz --
they are additional replicates (3.fq.genes.results and
8.fq.genes.results). Redo the differential expression analysis &
plotting (note: use '3,3' for rsem-run-ebseq).
----
A function to load in a ``genes.results`` file::
import csv
def load_results(filename):
r = csv.DictReader(open(filename, 'rb'), delimiter='\t')
results = {}
for row in r:
gene = row['gene_id']
results[gene] = row
return results
Code to extract the expression values (FPKM) for two samples and put
them in numpy array::
sample1 = load_results('1.fq.genes.results') # these are replicates
sample2 = load_results('2.fq.genes.results')
sample1v2 = []
for k in sample1:
s1 = float(sample1[k]['FPKM'])
s2 = float(sample2[k]['FPKM'])
if s1 == 0 or s2 == 0:
continue
else:
sample1v2.append((s1, s2))
sample1v2 = numpy.array(sample1v2)
Code to plot sample 1 vs sample 2::
x = plot(sample1v2[:,0], sample1v2[:,1], 'bo', alpha=0.1)
ax = axes()
ax.set_yscale('log')
ax.set_xscale('log')
Code to load in the '.changed' file::
def load_changed(filename, ppee_threshold=.05):
d = {}
r = csv.reader(open(filename, 'rb'), delimiter='\t')
r.next()
for row in r:
gene_id, PPEE, PPDE, postfc, realfc = row
PPEE=float(PPEE)
PPDE=float(PPDE)
postfc = float(postfc)
realfc = float(realfc)
if PPEE > ppee_threshold: # skip those with probability of equal expression > 0.05
continue
d[gene_id] = (PPEE, PPDE, postfc, realfc)
return d