Yale-EEB723 · diego-ellis-soto · Feb 14, 2019 · Feb 14, 2019 · Feb 21, 2019 · Feb 21, 2019
diff --git a/Class_notes/20190307_notes b/Class_notes/20190307_notes
@@ -0,0 +1,64 @@
+# --- --- --- --- --- --- --- --- --- --- --- ---
+Genome architecture:
+# --- --- --- --- --- --- --- --- --- --- --- ---
+
+Pulldown in molecualr ibology -> have pop of moecules and interested in the subset -> in the puldown i have something that binds the molecules that I want -> Can do a pulldown of the anti body that I want (bound to beets isolated with gravity or magnet).
+Chip-seq: Immuno prepitation methods.
+
+
+Chip-seq: Association protein thats sitting on the genome (DNA), thats one aspect of spatial structure -> Which proteins are close to which sequences.
+RNA seq measures transcript abundance -> But often we wanna know how much of a transcript is being produced by a given gene -> GRO-Seq can do this -> Transcript that were recently synthetized.
+GRO-seq is strong in conjunciton with other methods.
+
+A&B: Compartments defined by 2 PCA.
+A is first order effect -> derived as first 2 eigenvectors of Hi-C matrix.
+whenever i get mapping between2  reads i call that association between 2 regions.
+
+Hi-C have multiple cell types but im sampling a sample of organism -> but cells will be in different stages of the cell cycle -> I average out among that -> There is huge value of actually isolating the cells (what happens in RNA-seq).
+Now RNA-seq is single cell -> instead of aagregaitng with instrument over many cells -> I can study single cell .
+
+DNA -> RNA -> Prot
+DNA is upstrean from RNA and upstream from transcriptome
+Maybe transcription as an active process is essentially deforming the genome ->
+Maybe the very active transcriptome makes some areas of the DNA more accesible than others.
+some drugs disrupt transcirption to apply treatments -> Do Hi-C with drug and Hi-C without drug.
+Hi-C there is a single genome in there. when im mapping you observe physical conformation of single molecules.
+Hi-C: To build nice scaffold on the genome ->
+
+Hi-C interaction map:
+Which part of DNA are close to each other.
+
+Soon de novo assemblies people will come up with fasta file that describes sequence of genome AND add genome structure -> genome diagonal of Hi-C helps you build your genome.
+Hi-C: If you tune it, it can be cheap to do ->
+restriction enzyme, labeling things AND needs little DNA.
+High-C is done in the original tissue -> Add formaldehyde to tissue -> then your getting long ranged data.
+
+
+DNA is not linear in nucleus. but its precisely organised because of specific mechansism that put it into certain patterns.
+CTCF is like a stop sign that stops the extrusion process. Its like a large beet.
+
+Hi-C gives you data on all
+4C- focuses on one particular region !
+
+Map the reads to the genome then you take information where your sequencing reeds marks to the genome and use that to gain biological insight.
+Chip-seq: Great to ideintify where my favorite transcirption factor .
+
+We need a good annotated genome ->
+These tools will be deployed at scaled ->
+
+
+Genomes are so much work -> by the time you put it together ->
+you never just wanted a fasta file thats a genome ->
+as method mature -> what will happen -> assembly and annotation becomes more streamlined and ppl go directly to what they wanted to asnwer -> which genes are impacted by XYZ.
+-> becomes more biological.
+
+Being an intron of a gene doesnt mean that you have anything to do with this gene.
+
+Query genome with small set of tissue sample willtell us alot -> For gene speciic experiment.
+Exciting now: We can apply this of the shelf tols to quickly get down to a set of 10 genes that ar emost interested for my phenoype and my organism _> I  can pound or crispr all of these!
+-> Tools give me wider build of platform.
+-> Being able to deplo y these tools depends on high quality genome to map these too.
+
+NOTE to self:
+write a script that pushes to github itself with the input:
+basically what goes into git commit -am 'XXX_input'
diff --git a/Class_notes/28022019 b/Class_notes/28022019
@@ -0,0 +1,86 @@
+# --- --- --- --- --- --- --- --- --- ---
+# Casey Dunn 28.02.2019
+# --- --- --- --- --- --- --- --- --- ---
+Genomic technologies: Genome is 3D dynamic structure, has a shape in time and space.
+Progress understanding genome shape at multiple lvl (not just syntony, order) but also what parts of chromosome are next to other parts.
+# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
+Predecessors of RNA-seq: Know your genome before hand -> put probes of genome on wells ->
+microArray: often identify dieseases -> compare normal condition with cancer -> Isolate RNA -> generate cDNA -> Look at sequence ends -> get your reads -> Map to genome -> identify the transcriptome and other downstream analysis.
+RNA seq: What genes are being experssed and map it to the genome -> Trapnell et al. 2012 Nature Protocols. -> Need reference genome.
+When you dont have an annotated genome -> sequence transcriptome with long reads -> Assemble transcriptome and use that as reference genome where you map your RNA back.- >You dont use tophat for that, you use straight bowtie
+
+You have reads, map them with tophat, assemble and merge with cuffmerge, then get a transcriptome and then ....XXXX
+RNA often first normalized: Control for different sequencing effort in different sequence samples.
+RNA normalization, statistical analysis of ratio of expression between those libraries, we dont look at abolsute expression, its neither expression lvl of diferent genes, its expresion lvl of diferent genes at diferent samples.
+
+
+RNAseq: big interest in expression -> assess between diferent treatments cancers -> what are different genes doing.
+Is RNA telling you the absolute abundance of molecules in a cell? -> NO.
+Is gene A more active than  gene B? -> Count the transcripts.
+
+Gene A Count 1000 basepair 1500
+Gene B Count 2000 basepair 4500
+
+Library prep bias and amplification: Many library prep protocol biased towards certain sequences
+Sequencing effort is also important
+e
+Count = f(n, length of transcript, S, Effort, e)
+# RNA seq is not expression its transcript abundance!!! ->
+Function = series of knobs that are in serie (is that gene present YES/NO) ->
+is it in genome -> gene gain/loss is 1 level of activity.
+
+Gene gain/loss
+chromatin access
+transcribed
+transcript deg
+translational rate
+prot degreed
+prot modification # kinase activity on sets of genes and how it changes the activity of genes -> protein abundance -> i dont know what fraction of protein is active.
+# the abundance of mRNA doesnt tell you exactly about function
+-> Activity
+# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
+# Question: I have genbank accession number: How do I align the reference genome?
+# How do I download NCBI data based on alignments?
+# Email Ben Evans and ask. For conceptual advise.
+
+
+
+
+
+# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
+List of bat genomes: Myotis yumanensis,Myotis volans,Myotis thysanodes,Myotis ruber,Myotis riparius,Myotis petax,Myotis oxyotus,Myotis nigricans,Myotis myotis,Myotis muricola,Myotis melanorhinus,Myotis martiniquensis,Myotis macrodactylus,Myotis lucifugus,Myotis leibii,Myotis keaysi,Myotis ikonnikovi,Myotis horsfieldii,Myotis formosus,Myotis evotis,Myotis davidii,Myotis dominicensis,Myotis brandtii,Myotis bombinus,Myotis bechsteinii,Myotis auriculus,Myotis atacamensis,Myotis albescens
+Download
+# wget ftp://ftp.ncbi.nih.gov/genomes/Myotis_lucifugus/ARCHIVE/*
+# wget ftp://ftp.ncbi.nih.gov/genomes//Myotis_lucifugus/*
+
+mkdir Genomes/Myotis_lucifugus
+cd Genomes/Myotis_lucifugus/
+# wget ftp://ftp.ncbi.nih.gov/genomes/Myotis_lucifugus/ARCHIVE/*
+Download a reference genome from Ensembl:
+wget ftp://ftp.ensembl.org/pub/release-95/fasta/myotis_lucifugus/dna/*
+
+
+Add adapters to CDNA: and add adapters and directionality. Happens debofre amplification process. So you ligate to adapters and then amplify using PCR.
+
+# Make some directories:
+cd project/
+mkdir Classes
+Classes/
+mkdir Comparative_genomics
+cd Comparative_genomics/
+mkdir Bat_project
+cd Bat_project/
+ mkdir Code
+
+
+
+ssh -X -Y [email protected]
+newgrp eeb723
+# interactive session with 2 cores and default 10GiB of RAM
+srun -A eeb723 --pty -p interactive -c 2 bash
+mkdir -p /gpfs/ysm/project/eeb723/${USER}/eeb723-seqaln
+singularity shell --shell /bin/bash -B /gpfs/ysm/project/eeb723/${USER}/eeb723-seqaln:/data/eeb723-seqaln docker://eeb723/course_docker
+# Mak a copy of your github alignment:
+https://github.com/diego-ellis-soto/container-based-alignment.git
+cd container-based-alignment
+open . # On macOS, this command opens the current folder in the Finder
diff --git a/Class_notes/28032019 b/Class_notes/28032019
@@ -0,0 +1,81 @@
+# Phylogenetic course:
+
+Maximum likelihood estimation:
+
+Note and edges!
+
+Node any point collected by a branch. Nodes are allmost allways implicit.
+
+Bi-partition: Split tips into 2 groups. -> Split -> A division that occurs along the edge.
+1:1 relation between split and edges.
+
+Often many trees zB bootstrap sample.
+
+Make a list of all the splits you see in all the trees -> Count how often splits are found.
+
+Newick format: Text string to represent a tree.
+Things that are sisters to each other are placed with a coma and in a parentesis.
+
+AND rule: multiply probability
+OR rule: Add probabilities
+Combining AND and OR:
+Whats the probability of all of these things hapenning in combination.
+Independence: Dices are independent.
+
+Likelihood:
+Surprise = what happens and what our expectation was what would have happened.
+Expectation = our model.
+Our implicit model was that these are fair dice.
+
+Likelihood: The probability of the data given the model. How probable was the result we got, given our model understanding of the system
+the likelihood of all mdoels for a particular set of data are NOT expected to sum to one. These likelihood are super small normally.
+
+Maximum likelihood: Whats the probability of the data under all different models -> The model explaining the highest likelihood is the max likelihood.
+Actual probability of any specific outcome aproaches zero with increasing number.
+
+Relationships are depicted by the edges. -> Likelihood of what we see at A given what the ancestral sequence was. Now this is
+conditional likelihood.
+
+Check the legend to tell you what the edge length is. -> edge length could mean anything
+phylogram: Expected number of substitutions per site
+
+Maximum likelihood estimation of a phylogenetic trees:
+
+# Transition probability:
+Likelihood: Probability
+Max likelihood: The hypothesis with the highest likelihood given the data of the model.
+Likelihood for any given hypothesis is allways gonna be very small. ->
+Likelihood function is NOT a probability distribution function -> Does not encessarily sum up to 1.
+
+For likeliood: We need data, a model of evolution, a hypothesis and a mechanism to calculate the likelihood given the above.
+
+MU: gas pedal how fast the matrix is operating. Mean instantenaous rate.
+Relative rate: Tuning term; difference in rate from going to one state to another state.
+Assumptiion: Tme is reversible.
+
+# Sequence:
+
+How certain are you that your model is correct? Im absolutely certain that this is wrong! -> There is no way that these are the actual estiamtes!
+For what I want to know -> Are the vlaues close enough to the actual values to obtain a reasonable estimate of what i wanna predict.
+all models are wrong but some models are useful
+How much variation in the data can the model explain? -> This tells me whether I have a very inaequate model -> I have parameters that do a poor job in fitting my model to the data.
+
+
+Character evolution:
+DNA is a discrete trait.
+
+MK model: Assumes all the relative rates are the same.
+Sturdyng the morphology of the genome. Which orders are genes in, where do the introns go -> Quesitons on spatial organization
+-> Where is this limb, what colour is the limb in that spot?
+
+Cuz phylogenetics methods came out of systematics which came out of museum work which came out of morphology ->
+Analyze morphologica data is more advanced than methods sued to study evolution of genome morphology = comaprative genomics
+Genomics is new and fancy and tehcnologically crazy and descriptive morphology is measuring stuff with a caliper.
+Comparative genomics: scott roy san francisco state
+Intron gain and loss:
+Continious trait models:
+DNA and protein sequence evolution drove discrete trait models forward.
+
+
+Brownian walk: Is a independent proces! Markov process -> Only thing that is determining the next step is where I was on the current step. Nothing where I was before matters.
+The distribution is approximating a normal distributuon. 
diff --git a/Class_notes/4_phylotree b/Class_notes/4_phylotree
@@ -0,0 +1,30 @@
+# --- --- --- --- --- --- --- --- --- --- --- ---
+# [4] Building TLR gene tree
+# --- --- --- --- --- --- --- --- --- --- --- ---
+
+iqtree -s trimed_inout_TLR.fasta -spp All_TLR_gene_tree.nex -bb 1000 -nt AUTO # Could also be: trimed_inout_TLR.fasta.splits.nex Split support values:
+# Bellow are just comments mostly:
+# Now that we have aligned and trimmed our sequences its time to build our gene tree
+# echo 'Lets try IQtree' | conda install -c bioconda iqtree
+# Say bioconda is super useful instead of going to github pages or such to download
+# Construct a maximum likelihood tree
+iqtree -b 100 -nt AUTO -s trimed_inout_TLR.fasta
+This simple command will perform three important steps:
+Model selection (with ModelFinder) to select the best-fit model to the data.
+Reconstruct the ML tree with the selected model.
+Assess branch supports with the ultrafast bootstrap.
+# Once the run is done, IQ-TREE will write several output files including:
+# .iqtree: the main report file that is self-readable. You should look at this file to see the computational results. It also contains a textual representation of the final tree.
+# .treefile: the ML tree in NEWICK format, which can be visualized in FigTree.
+# .log: log file of the entire run (also printed on the screen).
+# .ckp.gz: checkpoint file used to resume an interrupted analysis.
+And a few other files.
+# In R transform the tree to a nexus format:
+require(phytools)
+iqtree -s trimed_inout_TLR.fasta -spp All_TLR_gene_tree.nex -bb 1000 -nt AUTO # Could also be: trimed_inout_TLR.fasta.splits.nex Split support values:
+tree <- read.tree('/Users/diegoellis/projects/development/Comparative_genomics/finalproject/Data/All_TLR_2/trimed_inout_TLR.fasta.contree')
+writeNexus(tree, file = '/Users/diegoellis/projects/development/Comparative_genomics/finalproject/Data/All_TLR_2/All_TLR_gene_tree.nex')
+# We now perform a partition model analysis, where one allows each partition to have its own model:
+# I transformed the concensus tree to nexus:
+iqtree -s trimed_inout_TLR.fasta -spp All_TLR_gene_tree.nex -bb 1000 -nt AUTO # Could also be: trimed_inout_TLR.fasta.splits.nex Split support values:
+iqtree -b 100 -nt -spp
diff --git a/Class_notes/Blast_21022019 b/Class_notes/Blast_21022019
@@ -0,0 +1,21 @@
+# Local alignment tool. Can be subfragment of the query. How do we match types?
+
+
+
+Similarity of diferent aminoacids and nucleotides: Matrices called PAM 250 and Blosum-62 (aminoacid on y and x axis)
+BLAST: Find homologous sequences in NCBI ->
+Find homologous molecules that are more dissimilar is hard
+
+
+BLAST good on reference genome ->
+EST = expressed sequence tag: When ppl did shotgun seq of transcriptome -> make cDNA, then put in plasmid, clone it and put it out in pret dish -> get a tag for each of those transcript -> very early form of shotgun sequencing using sanger sequence.
+EMBI database of protein families: PFAM.
+Interpro is a search engine analogous to blast.
+
+
+Annotation is hard for one species. So for an entire clade (10 species) is really hard ->
+BUT can borrow information across species and co-anothate.
+Maker: Is a portable genome assembly.
+
+Mapping: Have query sequences and have a reference genome -> mapping is when i expect the sequences to be nearly identical.
+# 
diff --git a/Class_notes/Building_a_blast_db_documentation b/Class_notes/Building_a_blast_db_documentation
@@ -0,0 +1,56 @@
+# Go to NCBI Protein:
+[1] Type in TLR7 Myotis
+[2] Download them all together as .FASTA
+[3] Finding a outgroup to the TLR gen:
+The tree is rooted by the outgroup TIR domain of Homo sapiens TLR5
+outgroup TIR domain Homo sapiens interleukin 1 receptor
+interleukin 1 Homo sapiens from the paper # Phylogeny of Toll-Like Receptor Signaling: Adapting the Innate Response Plos One
+https://www.ncbi.nlm.nih.gov/protein
+In NCBI Protein type in "interleukin 1 Homo sapiens"
+Mistake! We need the receptor not just the interleukin 1 -> This is probably just the signaling molecule
+In NCBI Protein type in  "Interleukin-1 Receptor Homo sapiens"
+[4] Blast the downloaded sequenes against the entire NCBI databases ->
+To confirm the predicted and low quality TLR protein sequences
+BlastP
+
+[5] Now we can try to align our sequences on TLR 7 using mafft
+# We Download MAFFT on https://anaconda.org/bioconda/mafft
+# I pasted all my sequences TLR gene and outgroup into a signle new fasta file that I will be aligning with MAST.
+# MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods.
+For this we have to put the ingroup (TLR genes) and the outgroup .fasta sequences into a single file.
+
+[6] Lets run mafft
+Now we have aligned our TLR genes with the outgroup TIR gene of humans! Yayyyy
+mafft --auto ingroup_w_outgroup.fasta > aligned_ingroup_w_outgroup.fasta
+# Input output
+When we align things not every thing aligns, could be due to low quality sequences and this  affects my phylogeny when I make it
+# phylogeny needs a degree im similarity in order to move.
+[7] Now we try to install gBlock for trimming
+conda install -c bioconda gblocks
+INDIR=/Users/diegoellis/projects/development/Comparative_genomics/finalproject/Data/TLR_7/
+gblocks and trimmomatic are failing; lets try trimAL
+trimal -in aligned_ingroup_w_outgroup.fasta -out trimmed_al_in_outgroup.fasta  -htmlout trimmed_al_in_outgroup.html -gt 1
+# will remove all columns with any gap (equivalent to -nogaps option)
+
+
+[8] Now that we have aligned and trimmed our sequences its time to build our gene tree
+# Lets try Iqtree
+conda install -c bioconda iqtree
+# Say bioconda is super useful instead of going to github pages or such to download
+iqtree -nt AUTO -s trimmed_al_in_outgroup.fasta # Makes the outgroup # trimmed_al_in_outgroup.fasta.treefile
+
+[9] The tree is in Newack format; i have to download figtree for this:
+ iqtree -nt AUTO -b 100 -s trimmed_al_in_outgroup.fasta  # Make a tree with 100 bootstrap.
+# For command line visualizaiton:
+conda install -c bioconda ete3
+ete3 view --text MyTreeFile.nw
+
+[10] Ape? Try this out.
+ete3 view --text MyTreeFile.nw
+Load a tree and link it to the alignment:
+from ete3 import PhyloTree
+# http://etetoolkit.org/docs/latest/tutorial/tutorial_phylogeny.html
+# Make a pipeline for everything else.
+was there diversification of TLR between bats or did it arose efore that?
+# Get the protein name and accesion number -> Show in my presentation for all my TLR genes.
+#
diff --git a/Class_notes/Casey_Dunn.docx b/Class_notes/Casey_Dunn.docx
diff --git a/Class_notes/Class_notes_presentaitons b/Class_notes/Class_notes_presentaitons
@@ -0,0 +1,45 @@
+# Computational genomics: Finding Ghost loci
+Gene loss
+Sponghe has nervous system
+Ctenophore has nervous system
+Establish loss -> Do a synteny analysis and look for gene loss
+Cnidaria genome: 2 sponghe genomes
+
+# Sampling of animals:
+# Iqtree
+
+# Distribution of scaffold length -> Most genomes/assemblies are pretty fragmented
+# Identify homologous genes -> find homologs with Agalma
+# Does all vs all blast.
+
+
+# generate ML Tree in IQTree -> Fix the topology using ML Tree and estimate divergence time chronograms in BEAST
+# Use FastML infer ancestral states given tree topology and branches
+
+
+
+
+
+
+
+# Nate
+# Nate ask a phylogeny for all the animals of Myotis and the outgroup!
+
+
+# SUMAC -g not really working or recognizing the FASTA file that I gave it.
+# IQtree reading on it.
+# Cant install Uclust !
+
+
+# BUSCO on HPC -> Look at busco scores and assembly length.
+# Assembly length not related to busco completeness
+
+
+# FASTA file of guide sequences not found. Please re-try.
+TLR 3 and 9 failed. TLR 7.
+MAKER gives us GFF files tells us genomic features and places in genome -> takes 5 day -> Use this info for ab initio gene prediciton software.
+
+
+Jim Nunan. 
+#
+Human gneomics -> 3-4 pset +
diff --git a/Class_notes/Comparative_genomics_lecture_notes.docx b/Class_notes/Comparative_genomics_lecture_notes.docx
diff --git a/Class_notes/Genome Assembly_07022019.docx b/Class_notes/Genome Assembly_07022019.docx