Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Final project -Pull request #13

Open
wants to merge 43 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
8d5886e
first change during class
Feb 14, 2019
b19f037
Added class notes
Feb 14, 2019
27a488e
Changed
Feb 21, 2019
15fb274
Added backround
Feb 21, 2019
ca5e5fb
Added goals
Feb 21, 2019
0831506
Changed font size
Feb 21, 2019
2b7feed
Added methods
Feb 21, 2019
b39f114
Adding methods
Feb 21, 2019
5631f8b
Changed gene names
Feb 21, 2019
6a86abc
Changed MS
Feb 21, 2019
8500445
tba
Feb 21, 2019
2945f30
Added references
Feb 21, 2019
645b24c
Added
Feb 21, 2019
19178c9
Added
Feb 21, 2019
8d76b09
Added method question
Feb 21, 2019
014dedc
Added UCSC genome browser
Feb 21, 2019
d6a2f03
Add
Feb 21, 2019
eebcdea
Added
Feb 21, 2019
00f335b
Added quesitons and methods
Feb 28, 2019
9c9587b
Added bat sp list
Feb 28, 2019
c140f1b
A
Feb 28, 2019
e52fb37
less ambitious, more ralistic
Mar 6, 2019
954bab0
Added code
Mar 6, 2019
5ad12ad
Ad
Mar 6, 2019
cde0a92
Updated
Apr 4, 2019
d121a4e
Added Gene bank sequences
Apr 10, 2019
cbcc870
AA
Apr 11, 2019
95eb63c
SUMAC installed finally
Apr 11, 2019
4d01a05
Changes a lot of things
Apr 11, 2019
0318402
Trying SUMAC on HPC
Apr 11, 2019
39e4bb8
Added hpc sumac
Apr 18, 2019
d0718b5
Made gene tree
Apr 20, 2019
3b42049
Added tree!
Apr 21, 2019
8b4dc08
Plotted tree and saved as newick
Apr 21, 2019
58005d1
df
Apr 21, 2019
66082b6
Trying out annotating. Need too add species name to tip
Apr 22, 2019
4e28e2c
Lots of now changes
Apr 25, 2019
3faebf1
changes
Apr 25, 2019
1c5ac8c
Added cleaned code
Apr 25, 2019
ce42c21
Push
Apr 25, 2019
7429314
pushed sumac
Apr 25, 2019
781934c
Final commit
May 1, 2019
0e0f3ff
Last one
May 1, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions Class_notes/20190307_notes
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# --- --- --- --- --- --- --- --- --- --- --- ---
Genome architecture:
# --- --- --- --- --- --- --- --- --- --- --- ---

Pulldown in molecualr ibology -> have pop of moecules and interested in the subset -> in the puldown i have something that binds the molecules that I want -> Can do a pulldown of the anti body that I want (bound to beets isolated with gravity or magnet).
Chip-seq: Immuno prepitation methods.


Chip-seq: Association protein thats sitting on the genome (DNA), thats one aspect of spatial structure -> Which proteins are close to which sequences.
RNA seq measures transcript abundance -> But often we wanna know how much of a transcript is being produced by a given gene -> GRO-Seq can do this -> Transcript that were recently synthetized.
GRO-seq is strong in conjunciton with other methods.

A&B: Compartments defined by 2 PCA.
A is first order effect -> derived as first 2 eigenvectors of Hi-C matrix.
whenever i get mapping between2 reads i call that association between 2 regions.

Hi-C have multiple cell types but im sampling a sample of organism -> but cells will be in different stages of the cell cycle -> I average out among that -> There is huge value of actually isolating the cells (what happens in RNA-seq).
Now RNA-seq is single cell -> instead of aagregaitng with instrument over many cells -> I can study single cell .

DNA -> RNA -> Prot
DNA is upstrean from RNA and upstream from transcriptome
Maybe transcription as an active process is essentially deforming the genome ->
Maybe the very active transcriptome makes some areas of the DNA more accesible than others.
some drugs disrupt transcirption to apply treatments -> Do Hi-C with drug and Hi-C without drug.
Hi-C there is a single genome in there. when im mapping you observe physical conformation of single molecules.
Hi-C: To build nice scaffold on the genome ->

Hi-C interaction map:
Which part of DNA are close to each other.

Soon de novo assemblies people will come up with fasta file that describes sequence of genome AND add genome structure -> genome diagonal of Hi-C helps you build your genome.
Hi-C: If you tune it, it can be cheap to do ->
restriction enzyme, labeling things AND needs little DNA.
High-C is done in the original tissue -> Add formaldehyde to tissue -> then your getting long ranged data.


DNA is not linear in nucleus. but its precisely organised because of specific mechansism that put it into certain patterns.
CTCF is like a stop sign that stops the extrusion process. Its like a large beet.

Hi-C gives you data on all
4C- focuses on one particular region !

Map the reads to the genome then you take information where your sequencing reeds marks to the genome and use that to gain biological insight.
Chip-seq: Great to ideintify where my favorite transcirption factor .

We need a good annotated genome ->
These tools will be deployed at scaled ->


Genomes are so much work -> by the time you put it together ->
you never just wanted a fasta file thats a genome ->
as method mature -> what will happen -> assembly and annotation becomes more streamlined and ppl go directly to what they wanted to asnwer -> which genes are impacted by XYZ.
-> becomes more biological.

Being an intron of a gene doesnt mean that you have anything to do with this gene.

Query genome with small set of tissue sample willtell us alot -> For gene speciic experiment.
Exciting now: We can apply this of the shelf tols to quickly get down to a set of 10 genes that ar emost interested for my phenoype and my organism _> I can pound or crispr all of these!
-> Tools give me wider build of platform.
-> Being able to deplo y these tools depends on high quality genome to map these too.

NOTE to self:
write a script that pushes to github itself with the input:
basically what goes into git commit -am 'XXX_input'
86 changes: 86 additions & 0 deletions Class_notes/28022019
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# --- --- --- --- --- --- --- --- --- ---
# Casey Dunn 28.02.2019
# --- --- --- --- --- --- --- --- --- ---
Genomic technologies: Genome is 3D dynamic structure, has a shape in time and space.
Progress understanding genome shape at multiple lvl (not just syntony, order) but also what parts of chromosome are next to other parts.
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
Predecessors of RNA-seq: Know your genome before hand -> put probes of genome on wells ->
microArray: often identify dieseases -> compare normal condition with cancer -> Isolate RNA -> generate cDNA -> Look at sequence ends -> get your reads -> Map to genome -> identify the transcriptome and other downstream analysis.
RNA seq: What genes are being experssed and map it to the genome -> Trapnell et al. 2012 Nature Protocols. -> Need reference genome.
When you dont have an annotated genome -> sequence transcriptome with long reads -> Assemble transcriptome and use that as reference genome where you map your RNA back.- >You dont use tophat for that, you use straight bowtie

You have reads, map them with tophat, assemble and merge with cuffmerge, then get a transcriptome and then ....XXXX
RNA often first normalized: Control for different sequencing effort in different sequence samples.
RNA normalization, statistical analysis of ratio of expression between those libraries, we dont look at abolsute expression, its neither expression lvl of diferent genes, its expresion lvl of diferent genes at diferent samples.


RNAseq: big interest in expression -> assess between diferent treatments cancers -> what are different genes doing.
Is RNA telling you the absolute abundance of molecules in a cell? -> NO.
Is gene A more active than gene B? -> Count the transcripts.

Gene A Count 1000 basepair 1500
Gene B Count 2000 basepair 4500

Library prep bias and amplification: Many library prep protocol biased towards certain sequences
Sequencing effort is also important
e
Count = f(n, length of transcript, S, Effort, e)
# RNA seq is not expression its transcript abundance!!! ->
Function = series of knobs that are in serie (is that gene present YES/NO) ->
is it in genome -> gene gain/loss is 1 level of activity.

Gene gain/loss
chromatin access
transcribed
transcript deg
translational rate
prot degreed
prot modification # kinase activity on sets of genes and how it changes the activity of genes -> protein abundance -> i dont know what fraction of protein is active.
# the abundance of mRNA doesnt tell you exactly about function
-> Activity
# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
# Question: I have genbank accession number: How do I align the reference genome?
# How do I download NCBI data based on alignments?
# Email Ben Evans and ask. For conceptual advise.





# --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- --- ---
List of bat genomes: Myotis yumanensis,Myotis volans,Myotis thysanodes,Myotis ruber,Myotis riparius,Myotis petax,Myotis oxyotus,Myotis nigricans,Myotis myotis,Myotis muricola,Myotis melanorhinus,Myotis martiniquensis,Myotis macrodactylus,Myotis lucifugus,Myotis leibii,Myotis keaysi,Myotis ikonnikovi,Myotis horsfieldii,Myotis formosus,Myotis evotis,Myotis davidii,Myotis dominicensis,Myotis brandtii,Myotis bombinus,Myotis bechsteinii,Myotis auriculus,Myotis atacamensis,Myotis albescens
Download
# wget ftp://ftp.ncbi.nih.gov/genomes/Myotis_lucifugus/ARCHIVE/*
# wget ftp://ftp.ncbi.nih.gov/genomes//Myotis_lucifugus/*

mkdir Genomes/Myotis_lucifugus
cd Genomes/Myotis_lucifugus/
# wget ftp://ftp.ncbi.nih.gov/genomes/Myotis_lucifugus/ARCHIVE/*
Download a reference genome from Ensembl:
wget ftp://ftp.ensembl.org/pub/release-95/fasta/myotis_lucifugus/dna/*


Add adapters to CDNA: and add adapters and directionality. Happens debofre amplification process. So you ligate to adapters and then amplify using PCR.

# Make some directories:
cd project/
mkdir Classes
Classes/
mkdir Comparative_genomics
cd Comparative_genomics/
mkdir Bat_project
cd Bat_project/
mkdir Code



ssh -X -Y [email protected]
newgrp eeb723
# interactive session with 2 cores and default 10GiB of RAM
srun -A eeb723 --pty -p interactive -c 2 bash
mkdir -p /gpfs/ysm/project/eeb723/${USER}/eeb723-seqaln
singularity shell --shell /bin/bash -B /gpfs/ysm/project/eeb723/${USER}/eeb723-seqaln:/data/eeb723-seqaln docker://eeb723/course_docker
# Mak a copy of your github alignment:
https://github.com/diego-ellis-soto/container-based-alignment.git
cd container-based-alignment
open . # On macOS, this command opens the current folder in the Finder
81 changes: 81 additions & 0 deletions Class_notes/28032019
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Phylogenetic course:

Maximum likelihood estimation:

Note and edges!

Node any point collected by a branch. Nodes are allmost allways implicit.

Bi-partition: Split tips into 2 groups. -> Split -> A division that occurs along the edge.
1:1 relation between split and edges.

Often many trees zB bootstrap sample.

Make a list of all the splits you see in all the trees -> Count how often splits are found.

Newick format: Text string to represent a tree.
Things that are sisters to each other are placed with a coma and in a parentesis.

AND rule: multiply probability
OR rule: Add probabilities
Combining AND and OR:
Whats the probability of all of these things hapenning in combination.
Independence: Dices are independent.

Likelihood:
Surprise = what happens and what our expectation was what would have happened.
Expectation = our model.
Our implicit model was that these are fair dice.

Likelihood: The probability of the data given the model. How probable was the result we got, given our model understanding of the system
the likelihood of all mdoels for a particular set of data are NOT expected to sum to one. These likelihood are super small normally.

Maximum likelihood: Whats the probability of the data under all different models -> The model explaining the highest likelihood is the max likelihood.
Actual probability of any specific outcome aproaches zero with increasing number.

Relationships are depicted by the edges. -> Likelihood of what we see at A given what the ancestral sequence was. Now this is
conditional likelihood.

Check the legend to tell you what the edge length is. -> edge length could mean anything
phylogram: Expected number of substitutions per site

Maximum likelihood estimation of a phylogenetic trees:

# Transition probability:
Likelihood: Probability
Max likelihood: The hypothesis with the highest likelihood given the data of the model.
Likelihood for any given hypothesis is allways gonna be very small. ->
Likelihood function is NOT a probability distribution function -> Does not encessarily sum up to 1.

For likeliood: We need data, a model of evolution, a hypothesis and a mechanism to calculate the likelihood given the above.

MU: gas pedal how fast the matrix is operating. Mean instantenaous rate.
Relative rate: Tuning term; difference in rate from going to one state to another state.
Assumptiion: Tme is reversible.

# Sequence:

How certain are you that your model is correct? Im absolutely certain that this is wrong! -> There is no way that these are the actual estiamtes!
For what I want to know -> Are the vlaues close enough to the actual values to obtain a reasonable estimate of what i wanna predict.
all models are wrong but some models are useful
How much variation in the data can the model explain? -> This tells me whether I have a very inaequate model -> I have parameters that do a poor job in fitting my model to the data.


Character evolution:
DNA is a discrete trait.

MK model: Assumes all the relative rates are the same.
Sturdyng the morphology of the genome. Which orders are genes in, where do the introns go -> Quesitons on spatial organization
-> Where is this limb, what colour is the limb in that spot?

Cuz phylogenetics methods came out of systematics which came out of museum work which came out of morphology ->
Analyze morphologica data is more advanced than methods sued to study evolution of genome morphology = comaprative genomics
Genomics is new and fancy and tehcnologically crazy and descriptive morphology is measuring stuff with a caliper.
Comparative genomics: scott roy san francisco state
Intron gain and loss:
Continious trait models:
DNA and protein sequence evolution drove discrete trait models forward.


Brownian walk: Is a independent proces! Markov process -> Only thing that is determining the next step is where I was on the current step. Nothing where I was before matters.
The distribution is approximating a normal distributuon.
30 changes: 30 additions & 0 deletions Class_notes/4_phylotree
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# --- --- --- --- --- --- --- --- --- --- --- ---
# [4] Building TLR gene tree
# --- --- --- --- --- --- --- --- --- --- --- ---

iqtree -s trimed_inout_TLR.fasta -spp All_TLR_gene_tree.nex -bb 1000 -nt AUTO # Could also be: trimed_inout_TLR.fasta.splits.nex Split support values:
# Bellow are just comments mostly:
# Now that we have aligned and trimmed our sequences its time to build our gene tree
# echo 'Lets try IQtree' | conda install -c bioconda iqtree
# Say bioconda is super useful instead of going to github pages or such to download
# Construct a maximum likelihood tree
iqtree -b 100 -nt AUTO -s trimed_inout_TLR.fasta
This simple command will perform three important steps:
Model selection (with ModelFinder) to select the best-fit model to the data.
Reconstruct the ML tree with the selected model.
Assess branch supports with the ultrafast bootstrap.
# Once the run is done, IQ-TREE will write several output files including:
# .iqtree: the main report file that is self-readable. You should look at this file to see the computational results. It also contains a textual representation of the final tree.
# .treefile: the ML tree in NEWICK format, which can be visualized in FigTree.
# .log: log file of the entire run (also printed on the screen).
# .ckp.gz: checkpoint file used to resume an interrupted analysis.
And a few other files.
# In R transform the tree to a nexus format:
require(phytools)
iqtree -s trimed_inout_TLR.fasta -spp All_TLR_gene_tree.nex -bb 1000 -nt AUTO # Could also be: trimed_inout_TLR.fasta.splits.nex Split support values:
tree <- read.tree('/Users/diegoellis/projects/development/Comparative_genomics/finalproject/Data/All_TLR_2/trimed_inout_TLR.fasta.contree')
writeNexus(tree, file = '/Users/diegoellis/projects/development/Comparative_genomics/finalproject/Data/All_TLR_2/All_TLR_gene_tree.nex')
# We now perform a partition model analysis, where one allows each partition to have its own model:
# I transformed the concensus tree to nexus:
iqtree -s trimed_inout_TLR.fasta -spp All_TLR_gene_tree.nex -bb 1000 -nt AUTO # Could also be: trimed_inout_TLR.fasta.splits.nex Split support values:
iqtree -b 100 -nt -spp
21 changes: 21 additions & 0 deletions Class_notes/Blast_21022019
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Local alignment tool. Can be subfragment of the query. How do we match types?



Similarity of diferent aminoacids and nucleotides: Matrices called PAM 250 and Blosum-62 (aminoacid on y and x axis)
BLAST: Find homologous sequences in NCBI ->
Find homologous molecules that are more dissimilar is hard


BLAST good on reference genome ->
EST = expressed sequence tag: When ppl did shotgun seq of transcriptome -> make cDNA, then put in plasmid, clone it and put it out in pret dish -> get a tag for each of those transcript -> very early form of shotgun sequencing using sanger sequence.
EMBI database of protein families: PFAM.
Interpro is a search engine analogous to blast.


Annotation is hard for one species. So for an entire clade (10 species) is really hard ->
BUT can borrow information across species and co-anothate.
Maker: Is a portable genome assembly.

Mapping: Have query sequences and have a reference genome -> mapping is when i expect the sequences to be nearly identical.
#
56 changes: 56 additions & 0 deletions Class_notes/Building_a_blast_db_documentation
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Go to NCBI Protein:
[1] Type in TLR7 Myotis
[2] Download them all together as .FASTA
[3] Finding a outgroup to the TLR gen:
The tree is rooted by the outgroup TIR domain of Homo sapiens TLR5
outgroup TIR domain Homo sapiens interleukin 1 receptor
interleukin 1 Homo sapiens from the paper # Phylogeny of Toll-Like Receptor Signaling: Adapting the Innate Response Plos One
https://www.ncbi.nlm.nih.gov/protein
In NCBI Protein type in "interleukin 1 Homo sapiens"
Mistake! We need the receptor not just the interleukin 1 -> This is probably just the signaling molecule
In NCBI Protein type in "Interleukin-1 Receptor Homo sapiens"
[4] Blast the downloaded sequenes against the entire NCBI databases ->
To confirm the predicted and low quality TLR protein sequences
BlastP

[5] Now we can try to align our sequences on TLR 7 using mafft
# We Download MAFFT on https://anaconda.org/bioconda/mafft
# I pasted all my sequences TLR gene and outgroup into a signle new fasta file that I will be aligning with MAST.
# MAFFT is a multiple sequence alignment program for unix-like operating systems. It offers a range of multiple alignment methods.
For this we have to put the ingroup (TLR genes) and the outgroup .fasta sequences into a single file.

[6] Lets run mafft
Now we have aligned our TLR genes with the outgroup TIR gene of humans! Yayyyy
mafft --auto ingroup_w_outgroup.fasta > aligned_ingroup_w_outgroup.fasta
# Input output
When we align things not every thing aligns, could be due to low quality sequences and this affects my phylogeny when I make it
# phylogeny needs a degree im similarity in order to move.
[7] Now we try to install gBlock for trimming
conda install -c bioconda gblocks
INDIR=/Users/diegoellis/projects/development/Comparative_genomics/finalproject/Data/TLR_7/
gblocks and trimmomatic are failing; lets try trimAL
trimal -in aligned_ingroup_w_outgroup.fasta -out trimmed_al_in_outgroup.fasta -htmlout trimmed_al_in_outgroup.html -gt 1
# will remove all columns with any gap (equivalent to -nogaps option)


[8] Now that we have aligned and trimmed our sequences its time to build our gene tree
# Lets try Iqtree
conda install -c bioconda iqtree
# Say bioconda is super useful instead of going to github pages or such to download
iqtree -nt AUTO -s trimmed_al_in_outgroup.fasta # Makes the outgroup # trimmed_al_in_outgroup.fasta.treefile

[9] The tree is in Newack format; i have to download figtree for this:
iqtree -nt AUTO -b 100 -s trimmed_al_in_outgroup.fasta # Make a tree with 100 bootstrap.
# For command line visualizaiton:
conda install -c bioconda ete3
ete3 view --text MyTreeFile.nw

[10] Ape? Try this out.
ete3 view --text MyTreeFile.nw
Load a tree and link it to the alignment:
from ete3 import PhyloTree
# http://etetoolkit.org/docs/latest/tutorial/tutorial_phylogeny.html
# Make a pipeline for everything else.
was there diversification of TLR between bats or did it arose efore that?
# Get the protein name and accesion number -> Show in my presentation for all my TLR genes.
#
Binary file added Class_notes/Casey_Dunn.docx
Binary file not shown.
45 changes: 45 additions & 0 deletions Class_notes/Class_notes_presentaitons
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Computational genomics: Finding Ghost loci
Gene loss
Sponghe has nervous system
Ctenophore has nervous system
Establish loss -> Do a synteny analysis and look for gene loss
Cnidaria genome: 2 sponghe genomes

# Sampling of animals:
# Iqtree

# Distribution of scaffold length -> Most genomes/assemblies are pretty fragmented
# Identify homologous genes -> find homologs with Agalma
# Does all vs all blast.


# generate ML Tree in IQTree -> Fix the topology using ML Tree and estimate divergence time chronograms in BEAST
# Use FastML infer ancestral states given tree topology and branches







# Nate
# Nate ask a phylogeny for all the animals of Myotis and the outgroup!


# SUMAC -g not really working or recognizing the FASTA file that I gave it.
# IQtree reading on it.
# Cant install Uclust !


# BUSCO on HPC -> Look at busco scores and assembly length.
# Assembly length not related to busco completeness


# FASTA file of guide sequences not found. Please re-try.
TLR 3 and 9 failed. TLR 7.
MAKER gives us GFF files tells us genomic features and places in genome -> takes 5 day -> Use this info for ab initio gene prediciton software.


Jim Nunan.
#
Human gneomics -> 3-4 pset +
Binary file not shown.
Binary file added Class_notes/Genome Assembly_07022019.docx
Binary file not shown.
Loading