Bioinformatics-cheatsheet

A cheat sheet for Bioinformatians. @Github Pages

General Elements
Interactions/Regulations/Associations
Epigenetics
- DNA Methylation
  - DNA Methylation Detection Methods
- Histone Modification
Biological Processes
- Pathways
Drug/Chemicals
- Drug/Small Molecule Database
Mutations and Diseases
Next Generate Sequencing
File Formats
- Formats
- Tools
Math/Statistics

General Elements

DNA/Gene/Genome

Related Terms

DNA: Deoxyribonucleic acid is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. @Wiki, @NIH
Gene: A gene is a locus (or region) of DNA which is made up of nucleotides and is the molecular unit of heredity. @Wiki, @NIH
Promoter: In genetics, a promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites of genes, on the same strand and upstream on the DNA (towards the 5' region of the sense strand). @Wiki
TSS - Transcription Start Side: The transcription start site is the location where transcription starts at the 5'-end of a gene sequence. @Wiki
Expression (Gene expression): Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as transfer RNA (tRNA) or small nuclear RNA (snRNA) genes, the product is a functional RNA. @Wiki, @Scitable
Exon: An exon is any part of a gene that will become a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. @Wiki
Intron: An intron is any nucleotide sequence within a gene that is removed by RNA splicing during maturation of the final RNA product. @Wiki

Genome/Sequence Databases

SO - Sequence Ontology: SO is a collaborative ontology project for the definition of sequence features used in biological sequence annotation.
Ensembl genome browser: Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.
Candida Genome Database: Resource for genomic sequence data and gene and protein information for Candida albicans.
WormBase: Worm Base.
FlyBase: FlyBase: a database of Drosophila Genes & Genomes.
MGI - Mouse Genome Informatics: MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.
RGD - Rat Genome Database: The Rat Genome Database (RGD) is the premier site for genetic, genomic, phenotype, and disease data generated from rat research.
Saccharomyces Genome Database: The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.

General Gene Databases

H-InvDB: H-Invitational Database (H-InvDB) is an integrated database of human genes and transcripts.
KEGG GENES: KEGG GENES is a collection of gene catalogs for all complete genomes generated from publicly available resources, mostly NCBI RefSeq and GenBank.
HGNC: HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.
GeneCards: GeneCards is a searchable, integrated, database of human genes that provides concise genomic related information, on all known and predicted human genes.
NCBI Gene: A portal to gene-specific content based on NCBI's RefSeq project, information from model organism databases, and links to other resources.
WikiGenes: WikiGenes is a non-profit initiative to provide a global collaborative knowledge base for the life sciences, where authorship matters.
GENCODE: Encyclopedia of genes and gene variants.
Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. @Ref

Specialized/Disease-associated Gene Databases

CADgene: Coronary Artery Disease Gene Database.
GenAge: GenAge: The Ageing Gene Database.

Gene Prediction

BGF: It is a hidden Markov model (HMM) and dynamic programming based ab initio gene prediction program.

Promoter/TSS Prediction

PePPER: Prediction of prokaryote promoters.
Promoter2.0: Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA sequences.

RNA

Related Terms

RNA: Ribonucleic acid (RNA) is a polymeric molecule implicated in various biological roles in coding, decoding, regulation, and expression of genes. @Wiki
3'-UTR: is the section of messenger RNA (mRNA) that immediately follows the translation termination codon. @Wiki
5'-UTR: The 5' untranslated region (5′ UTR) (also known as a Leader Sequence or Leader RNA) is the region of an mRNA that is directly upstream from the initiation codon. @Wiki

Protein

Related Terms

Protein: Proteins are large biomolecules, or macromolecules, consisting of one or more long chains of amino acid residues. @Wiki
Translation: In molecular biology and genetics, translation is the process in which cellular ribosomes create proteins. In translation, messenger RNA (mRNA)—produced by transcription from DNA—is decoded by a ribosome to produce a specific amino acid chain, or polypeptide. @Wiki

Protein/Protein Domain Databases

iPfam: Protein families database of alignments and HMMs.
iProClass: The iProClass database provides value-added information reports for UniProtKB and unique NCBI Entrez protein sequences in UniParc, with links to over 160 biological databases, including databases for protein families, functions and pathways, interactions, structures and structural classifications, genes and genomes, ontologies, literature, and taxonomy.
MiST: The Microbial Signal Transduction database contains the signal transduction proteins for bacterial and archaeal genomes.
ModBase: ModBase is a database of comparative protein structure models, calculated by modeling pipeline ModPipe.
RCSB PDB: The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies.
PepBank: PepBank is a database of peptides based on sequence text mining and public peptide data sources.
PROFESS: PROFESS is a biology database system that integrates databases describing PROtein Functions, Evolution, Structures and Sequences.
ProtCID: PROTein Common Interfaces Database.
SUBA3: The SUBcellular localization database for Arabidopsis proteins.
SynSysNet: Synaptic Proteins Database.
ASD: Allosteric Database.

Enhancer

Related Terms

Enhancer: In genetics, an enhancer is a short (50-1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcription factors. @Wiki
Super Enhancer: In genetics, a super-enhancer is a region of the mammalian genome comprising multiple enhancers that is collectively bound by an array of transcription factor proteins to drive transcription of genes involved in cell identity. @Wiki, @Nature Genetics
MPRA: MPRA is a high-throughput technology that enables the analysis of transcriptional activities of thousands of regulatory elements in a single experiment. @Ref

Enhancer Databases

VISTA Enhancer Browser: The VISTA Enhancer Browser is a central resource for experimentally validated human and mouse noncoding fragments with gene enhancer activity as assessed in transgenic mice.
DENdb: DENdb is a centralized on-line repository of predicted enhancers derived from multiple human cell-lines.
dbSUPER: dbSUPER is the first integrated and interactive database of super-enhancers.
SEA: a super-enhancer archive.
EI: Database of EI candidate tissue-specific enhancers: Predicting Tissue-Specific Enhancers in the Human Genome. @Ref
EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. @Ref

Enhancer Prediction

DEEP: a general computational framework for predicting enhancers

Interactions/Regulations/Associations

Transcription Factor - Target

Related Terms

TF - Transcription Factor: Transcription factors are proteins that control which genes are turned on or off in the genome. They do so by binding to DNA and other proteins. @Wiki, @BroadInstitute, @Scitable
PWM - Position Weight Matrix/PSWM/PSSM: A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences. @Wiki
TFBS - Transcription Factor Binding Site/DNA Binding Site: DNA binding sites are a type of binding site found in DNA where other molecules may bind. @Wiki
DNA Sequence Motif: Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. @Nature Biotechnology, @Wiki
Transcription: Transcription is the first step of gene expression, in which a particular segment of DNA is copied into RNA (mRNA) by the enzyme RNA polymerase. @Wiki

Transcription Factor Databases

AnimalTFDB: AnimalTFDB is a comprehensive database including classification and annotation of genome-wide transcription factors (TFs), transcription co-factors and chromatin remodeling factors in 65 animal genomes.
DBD: DBD is a database of predicted transcription factors in completely sequenced genomes.
PlantTFDB: Plant transcription factor database, a portal for the functional and evolutionary study of plant transcription factors.
TFCat: TFCat: The curated catalog of mouse and human transcription factors.
TFdb: The Mouse transcription factor database (TFdb) is a database containing mouse transcription factor genes and their related genes.

TFBS/TF Binding Motif/TF Target Databases

Cistrome DB: Cistrome DB is a comprehensive resource of hg38 and mm10 ChIP-seq data collection. Here is a brief introduction about the workflow of ChiLin.
CollecTF: CollecTF is a database of transcription factor binding sites (TFBS) in the Bacteria domain.
CTCFBSDB: A database for CTCF binding sites and genome organization.
FactorBook: This website organizes the analysis results of ENCODE TF ChIP-seq data, integrated with other ENCODE data such as ChIP-seq of histone marks and nucleosome occupancy.
footprintDB: footprintDB is a web server for assigning putative cis DNA motifs to input transcription factors (TFs) and conversely for predicting which TFs that might recognize input DNA motifs.
hmChIP: hmChIP is a database of genome-wide chromatin immu-noprecipitation (ChIP) data in human and mouse.
HOCOMOCO: HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) contains transcription factor (TF) binding models represented as classic Position Weight Matrices (PWMs, also known as Position-Specific Scoring Matrices, PSSMs) and precalculated score thresholds.
HOMER Motif Database: This database is maintained as part of HOMER and is mostly based on the analysis of public ChIP-Seq data sets.
hPDI: The hPDI database holds experimental protein-DNA interaction data for humans identified by protein microarray assays.
HTRIdb: Human Transcriptional Regulation Interaction Database.
JASPAR: The high-quality transcription factor binding profile database.
MAPPER: MAPPER is a platform for the computational identification of transcription factor binding sites (TFBSs) in multiple genomes, that combines TRANSFAC® and JASPAR data with the search power of profile hidden Markov models (HMMs).
MotifMap: The MotifMap system provides comprehensive maps of candidate regulatory elements encoded in the genomes of model species using databases of transcription factor binding motifs, refined genome alignments, and a comparative genomic statistical approach - Bayesian Branch Length Score.
oPOSSUM: oPOSSUM is a web-based system for the detection of over-represented conserved transcription factor binding sites and binding site combinations in sets of genes or sequences.
SwissRegulon: Swissregulon Database contains genome-wide annotations of regulatory sites.
TFBSshape: TFBSshape provides DNA shape features for transcription factor binding sites (TFBSs) that in addtion to sequence features, usually in the form of position weight matrices (PWMs), characterize DNA binding specificities of transcription factors (TFs) from 23 different species.
TRANSFAC: TRANSFAC® is a unique knowledge-base containing published data on eukaryotic transcription factors and miRNAs, their experimentally-proven binding sites, and regulated genes.
UniPROBE: The UniPROBE (Universal PBM Resource for Oligonucleotide Binding Evaluation) database hosts data generated by universal protein binding microarray (PBM) technology on the in vitro DNA binding specificities of proteins.

TFBS Prediction

DeepSEA: DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. @Ref

Protein-DNA Interaction Detection Methods

PBM - Protein Binding Microarray: //TODO
ChIP: Chromatin Immunoprecipitation (ChIP) is a type of immunoprecipitation experimental technique used to investigate the interaction between proteins and DNA in the cell. @Wiki
ChIP-seq: ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. @Wiki
ChIP-chip: ChIP-chip (also known as ChIP-on-chip) is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. @Wiki

Protein-Protein/Chemical Interaction

Related Terms

PPI - Protein-Protein Interaction: Protein–protein interactions (PPIs) refer to lasting or ephemeral physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by electrostatic forces including the hydrophobic effect. @Wiki

Protein-Protein Interaction Databases

2P2Idb: 2P2I_DB is a hand-curated database dedicated to the structure of protein-protein complexes with known small molecule inhibitors.
3D-Interologs: The 3D-Interologs is a cross-species interacting database inferring from three-dimensional (3D) protein structure complexes and a novel scoring function by using 3D-domain interologs.
3DID: The database of three-dimensional interacting domains (3did) is a collection of high-resolution three-dimensional structural templates for domain-domain interactions.
ANAP: Arabidopsis Network Analysis Pipeline.
AntiJen: AntiJen v2.0, is a database containing quantitative binding data for peptides binding to MHC Ligand, TCR-MHC Complexes, T Cell Epitope, TAP , B Cell Epitope molecules and immunological Protein-Protein interactions.
APID: APID (Agile Protein Interactomes DataServer) provides a comprehensive collection of protein interactomes for more than 400 organisms based in the integration of known experimentally validated protein-protein physical interactions (PPIs).
ASPD: ASPD (Artificial Selected Proteins/Peptides Database) is a curated database on selected from randomized pools proteins and peptides.
ATDB:ATDB mainly focuses on construct a globe-scale animal toxin-channel interaction network based on literatures and database annotations.
AtPID: Arabidopsis thaliana Protein Interactome Database.
Bacteriome.org: Bacterial Protein Interaction Database for Escherichia Coli.
BIANA: Biologic Interaction and Network Analysis.
BID: Binding Interface Database.
BioGRID: BioGRID Is An Online Interaction Respository With Data Compiled Through Comprehensive Curation Efforts.
BISC: BISC(BInary SubComplex Database) is a new protein-protein interaction (PPI) database intending to bridge between the two communities most active in their characterisation: structural biology and functional genomics researchers.
CCSB Interactome Database: Center for Cancer Systems Biology Interactome Database.
ComSim: Database of protein structures in bound (Complex) and unbound (Single) states.
CORUM: Comprehensive resource of mammalian protein complexes.
CTDB: Calmodulin Target Database.
CutDB: CutDB: Proteolytic Event Database.
DeathDomain: A manually curated database of protein-protein interactions for Death Domain Superfamily.
DIMA: DIMA is a Domain Interaction MAp and aims at becoming a comprehensive resource for functional and physical interactions among conserved protein-domains.
DIP: The DIP^TM database catalogs experimentally determined interactions between proteins.
DOMINE: DOMINE is a database of known and predicted protein domain (domain-domain) interactions.
DOMINO: DOMINO is an open-access database comprising more than 3900 annotated experiments describing interactions mediated by protein-interaction domains.
DOMMINO: Database of MacroMolecular Interactions .
DroID: DroID is a comprehensive gene and protein interactions (interactome) database designed specifically for the model organism Drosophila.
DroPNet: Drosophila Protein Network.
EciD: E. coli Interaction Database.
FunCoup: FunCoup is a framework to infer genome-wide functional couplings in 11 model organisms.
Gene3D: Gene3D takes CATH domain families (from PDB structures) and assigns them to the millions protein sequences (using Hidden Markov models generated from HMMER) with no PDB structures.
gpDB: a database of GPCRs, G-proteins, Effectors and their interactions.
GWIDD: Genome-WIde protein Docking Database.
HCPIN: Human Cancer Protein Interaction Network.
HCVpro: Hepatitus C Virus Protein Interaction Database.
HINT:HINT (High-quality INTeractomes) is a database of high-quality protein-protein interactions in different organisms.
HitPredict: HitPredict is a resource of experimentally determined protein-protein interactions with reliability scores.
HIV-1 Human Interaction Database: The HIV-1, human interactions project collates published reports of two types of interactions - protein interactions, and human gene knock-downs that affect virus replication and infectivity (reported as 'replication interactions').
HIVMID: HIV Molecular Immunology Database.
HotRegion: A Database of Cooperative Hotspots.
HP-DPI: Helicobacter pylori Database of Protein Interactomes.
HPID: Human Protein Interaction Database.
HPIDB: Host-Pathogen Interaction Database.
HPRD: The Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.
Human-gpDB: A database of human GPCRs, G-proteins, Effectors and their interactions.
HumanPSD: Human Proteome Survey Database .
HuPI: database of the Human Proteotheque Initiative.
I2D: Interologous Interaction Database.
IBIS: Inferred Biomolecular Interactions (protein-protein, protein-small molecule, protein nucleic acids and protein-ion interactions) Server.
ICBS: A database of protein-protein interactions mediated by interchain ß-sheet formation.
IMEx: The International Molecular Exchange Consortium.
iMOTdb: Interacting motifs in proteins database.
InnateDB: A Knowledge Resource for Innate Immunity Interactions and Pathways.
INstruct: a database of 3D protein interactome networks with structural resolution.
IntAct: IntAct provides a freely available, open source database system and analysis tools for molecular interaction data.
Interactome: Krogan Lab Interactome Database.
InterDom: InterDom is a database of putative interacting protein domains derived from multiple sources, ranging from domain fusions (Rosetta Stone), protein interactions (DIP and BIND), protein complexes (PDB), to scientific literature (MEDLINE).
InterEvol: InterEvol database is designed for the analysis of co-evolution events at the interface of known structures of hetero- and homo-oligomers.
Interfaces: DATASET OF PROTEIN-PROTEIN INTERFACES.
Interolog: Interolog/Regulog Database.
InteroPorc: InteroPorc is an automatic prediction tool to infer protein-protein interaction networks.
iRefIndex: iRefIndex provides an index of protein interactions available in a number of primary interaction databases including BIND, BioGRID, CORUM, DIP, HPRD, InnateDB, IntAct, MatrixDB, MINT, MPact, MPIDB and MPPI.
iRefWeb: Interaction Reference Index Web Interface.
IRView: a database and viewer of interacting regions (IRs) in protein sequences.
MatrixDB: MatrixDB stores experimental data established by full-length proteins, matricryptins, glycosaminoglycans, lipids and cations.
MINT: MINT focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.
MIPS-MPPI: MIPS Mammalian Protein-Protein Interaction Database.
MPI-LIT: the microbial protein interaction database.
MPID: Magnaporthe grisea Protein-protein Interaction Database.
MPID-T: MHC-Peptide Interaction Database.
MPIDB: Microbial Protein Interaction Database.
NCG: NCG collects information on duplicability, orthology, evolutionary appearance and protein interactions network (PIN) properties of 736 cancer genes.
NCPI: Neurospora Crassa Protein Interactome Database.
Negatome: The Negatome is a collection of protein and domain pairs which are unlikely engaged in direct physical interactions.
PRISM: Protein Interactions by Structural Matching.
PCRPi-DB: PCRPi-DB is a database of computationally annotated hot spots in protein interfaces.
PDZBase: PDZBase is a manually curated protein-protein interaction database developed specifically for interactions involving PDZ domains.
PICCOLO: PICCOLO is a comprehensive database of structurally-characterized protein-protein interactions described at atomic level.
PIPs: PIPs is a database of predicted human protein-protein interactions.
PiSITE: PiSITE is a web-based database of protein interaction sites.
PPIRA: Protein-Protein Interactions between Ralstonia solanacearum and Arabidopsis thaliana.
PDBePISA: PDBePISA is an interactive tool for the exploration of macromolecular interfaces.
PRIN: Predicted Rice Interactome Database.
RKD: Rice Kinase Database.
SCOPPI: Structural classification of protein-protein interfaces.
SCOWLP: structural classification of protein binding reasons for atomic comparative analysis of protein interactions.
SNAPPI-DB: Structures, iNterfaces and Alignments for Protein-Protein Interactions.
STRING: functional protein association networks.
Struct2Net: Structure-based Computational Predictions of Protein-Protein Interactions.
SYFPEITHI: Database of MHC Ligands and Peptide Motifs.
TissueNet: The Database of Human Tissue Protein-Protein Interactions.
TRIP: a manually curated database of protein-protein interactions for mammalian TRP channels.
Wiki-Pi: Wiki-Pi: a wiki resource centred on human protein-protein interactions.
XooNET: Integrated Protein-Protein Interaction database of Xanthomonas oryzae pathovar oryzae KACC1031.

Protein-Chemical Interaction Databases

ChemProt: The ChemProt 3.0 server is a ressource of annotated and predicted chemical-protein interactions.

PPI Detection Methods

CoIP - Co-immunoprecipitation: is considered to be the gold standard assay for protein–protein interactions, especially when it is performed with endogenous (not overexpressed and not tagged) proteins. The protein of interest is isolated with a specific antibody. Interaction partners which stick to this protein are subsequently identified by Western blotting. Interactions detected by this approach are considered to be real.
Bimolecular fluorescence complementation: (BiFC) is a new technique in observing the interactions of proteins. Combining with other new techniques, this method can be used to screen protein–protein interactions and their modulators, DERB.
Affinity electrophoresis: as used for estimation of binding constants, as for instance in lectin affinity electrophoresis or characterization of molecules with specific features like glycan content or ligand binding.
Pull-down assays: are a common variation of immunoprecipitation and immunoelectrophoresis and are used identically, although this approach is more amenable to an initial screen for interacting proteins.
Label transfer: can be used for screening or confirmation of protein interactions and can provide information about the interface where the interaction takes place. Label transfer can also detect weak or transient interactions that are difficult to capture using other ''in vitro'' detection strategies. In a label transfer reaction, a known protein is tagged with a detectable label. The label is then passed to an interacting protein, which can then be identified by the presence of the label.
Y2H - Yeast Two-Hybrid: Y2H screen investigates the interaction between artificial fusion proteins inside the nucleus of yeast. This approach can identify binding partners of a protein in an unbiased manner.
Phage display: used for the high-throughput screening of protein interactions
TAP - Tandem Affinity Purification: (TAP) method allows high throughput identification of protein interactions. In contrast to yeast two-hybrid approach the accuracy of the method can be compared to those of small-scale experiments and the interactions are detected within the correct cellular environment as by co-immunoprecipitation. However, the TAP tag method requires two successive steps of protein purification and consequently it can not readily detect transient protein–protein interactions.
Cross-link/Chemical cross-linking: is often used to "fix" protein interactions in place before trying to isolate/identify interacting proteins. Common crosslinkers for this application include the non-cleavable NHS-ester cross-linker, bissulfosuccinimidyl suberate (BS3); a cleavable version of BS3, dithiobis(sulfosuccinimidyl propionate) (DTSSP); and the imidoester cross-linker dimethyl dithiobispropionimidate (DTBP) that is popular for fixing interactions in ChIP assays.
SPINE: (Strepprotein interaction experiment) uses a combination of reversible crosslinking with formaldehyde and an incorporation of an affinity tag to detect interaction partners ''in vivo''.
Quantitative immunoprecipitation combined with knock-down: (QUICK) relies on co-immunoprecipitation, quantitative mass spectrometry (SILAC) and RNA interference (RNAi). This method detects interactions among endogenous non-tagged proteins. Thus, it has the same high confidence as co-immunoprecipitation. However, this method also depends on the availability of suitable antibodies.
Proximity ligation assay: (PLA) in situ is an immunohistochemical method utilizing so called PLA probes for detection of proteins, protein interactions and modifications.

Correlation Databases

CORNET: CORrelation NETworks.
GeneMANIA: GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional association data.

Epigenetics

"The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence." @Ref

DNA Methylation

DNA Methylation Detection Methods

MeDIP/mDIP - Methylated DNA immunoprecipitation: Methylated DNA immunoprecipitation (MeDIP or mDIP) is a large-scale (chromosome- or genome-wide) purification technique in molecular biology that is used to enrich for methylated DNA sequences. @Wiki
MeDIP-seq: The MeDIP-seq approach, i.e. the coupling of MeDIP with next generation, short-read sequencing technologies such as 454, Illumina (company) (Solexa), was first described by Down et al. in 2008. The high-throughput sequencing of the methylated DNA fragments produces a large number of short reads (36-50bp or 400 bp depending on the technology).

Histone Modification

Biological Processes

Pathways

Related Terms

Pathway (Biological pathway): A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell. Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move. @Wiki, @NIH

Pathway Databases

ConsensusPathDB: ConsensusPathDB is a database that integrates different types of functional interactions between physical entities in the cell like genes, RNA, proteins, protein complexes and metabolites in order to assemble a more complete and a less biased picture of cellular biology.
KEGG PATHWAY: KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for different aspects.
MetaCyc: MetaCyc is a curated database of experimentally elucidated metabolic pathways from all domains of life.
MouseCyc: MouseCyc is a database of curated biochemical pathways data for the laboratory mouse that can be integrated with functional and phenotypic data from MGI.
PANTHER Pathway: PANTHER Pathway consists of over 177, primarily signaling, pathways, each with subfamilies and protein sequences mapped to individual pathway components.
Pathway Commons: Pathway Commons aims to store and disseminate knowledge about biological pathways. Information is sourced from public pathway databases and is readily searched, visualized, and downloaded.
Reactome: Reactome is a free, open-source, curated and peer reviewed pathway database.
PLANTCYC: PlantCyc is a metabolic pathway reference database containing more than 800 pathways and their catalytic enzymes and genes, as well as compounds from over 350 plant species. It includes: AraCyc(Arabidopsis thaliana col), BarleyCyc(Hordeum vulgare), BrachypodiumCyc(Brachypodium distachyon), CassavaCyc(Manihot esculenta esculenta), ChineseCabbageCyc(Brassica rapa ssp. pekinensis), ChlamyCyc(Chlamydomonas reinhardtii), CornCyc(Zea mays mays), GrapeCyc(Vitis vinifera), MossCyc(Physcomitrella patens), OryzaCyc(Oryza sativa japonica group), PapayaCyc(Carica papaya), PoplarCyc(Populus trichocarpa, other Populus species and hybrids), PotatoCyc(Solanum tuberosum), SelaginellaCyc(Selaginella moellendorffii), SetariaCyc(Setaria italica), SorghumBicolorCyc(Sorghum bicolor), SoyCyc(Glycine max), SpirodelaCyc(Spirodela polyrhiza), SwitchgrassCyc(Panicum virgatum), TomatoCyc(Solanum lycopersicum), WheatACyc(Triticum urartu), WheatDCyc(Aegilops tauschii)
SignaLink: An integrated resource to analyze signaling pathway cross-talks, transcription factors, miRNAs and regulatory enzymes.
SMPDB: SMPDB (The Small Molecule Pathway Database) is an interactive, visual database containing more than 618 small molecule pathways found in humans. More than 70% of these pathways (>433) are not found in any other pathway database.
Yeast Pathways Database: The Yeast Pathways Database is a collection of manually curated metabolic pathways and enzymes of Saccharomyces cerevisiae.

Pathway Predictions

PIUMet: Inferring Disease-Modifying Pathways and Hidden Components via Integrative Analysis of Metabolite Features with Various Omic Data. @Ref

Pathway/Network analysis/visualizers

HotNet2: HotNet2 is an algorithm for finding significantly altered subnetworks in a large gene interaction network. While originally developed for use with cancer mutation data, the current release also supports any application in which meaningful scores can be assigned to genes in the network. @Ref
CellMaps: CellMaps is an open source HTML5 web-based application that allows researchers to easily model, visualize, integrate data and analyse biological networks inside a web browser.

Drug/Chemicals

Drug/Small Molecule Database

AHD2.0: The aim of the Arabidopsis hormone database is to provide a systematic and comprehensive view of morphological phenotypes regulated by plant hormones, as well as regulatory genes participating in numerous plant hormone responses.
DrugBank: The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.
TTD - Therapeutic Target Database: A database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets.
STITCH: STITCH is a resource to explore known and predicted interactions of chemicals and proteins.

Mutations and Diseases

GWAS

Related Terms

GWAS: In genetics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an examination of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. @Wiki
Meta-analysis: Meta-analysis is routinely used for pooling the results from genome-wide association studies (GWAS). @Review
TWAS: transcriptome-wide association study through expression imputation. @Ref

Meta-Analysis Tools

RAREMETAL: RAREMETAL is a program that facilitates the meta-analysis of rare variants from genotype arrays or sequencing.

Downstream GWAS Analysis

PrediXcan: PrediXcan is a gene-based association test that prioritizes genes that are likely to be causal for the phenotype. @Ref
MetaXcan: MetaXcan is an extension of PrediXcan method, that infers the results of PrediXcan using only summary statistics.
PredictDB: This PredictDB Data Repository hosts genetic prediction models of transcriptome levels to be used with PrediXcan and MetaXcan.

Disease/Phenotype Databases/Ontologies

HPO - Human Phenotype Ontoloy: The Human Phenotype Ontology (HPO) aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease.
DO - Disease Ontology: The Disease Ontology semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM.
MeSH - Medical Subject Headings: MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.

Germline Mutations and Genetic Diseases

Related Terms

CNV - Copy-Number Variation: Copy-number variations (CNVs) are a form of structural variation that manifest as deletions or duplications in the genome. @Wiki(https://www.genome.gov/25520880/deoxyribonucleic-acid-dna-fact-sheet/)
SNP: A single nucleotide polymorphism, often abbreviated to SNP (pronounced snip; plural snips), is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. >1%). @Wiki, @NIH
Genotype: The genotype is the part (DNA sequence) of the genetic makeup of a cell, and therefore of an organism or individual, which determines a specific characteristic (phenotype) of that cell/organism/individual. @Wiki
PheWAS: Phenome-wide association studies (PheWAS) analyze many phenotypes compared to a single genetic variant (or other attribute). @Tool
Haplotype: A haplotype (haploid genotype) is a group of genes in an organism that are inherited together from a single parent. A haplogroup is a group of similar haplotypes that share a common ancestor with a single-nucleotide polymorphism mutation. @Wiki, @Scitable
IBD - Identity By Descent: A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. @Wiki
Phasing/Haplotype Estimation: In genetics, haplotype estimation (also known as "phasing") refers to the process of statistical estimation of haplotypes from genotype data.
ICD: The International Statistical Classification of Diseases and Related Health Problems, usually called by the short-form name International Classification of Diseases (ICD), is the international "standard diagnostic tool for epidemiology, health management and clinical purposes". @Wiki
MAF - Minor Allele Frequency: MAF refers to the frequency at which the second most common allele occurs in a given population. SNPs with a minor allele frequency of 5% or greater were targeted by the HapMap project.
LOH - Loss Of Heterozygosity: Loss of heterozygosity (LOH) is a gross chromosomal event that results in loss of the entire gene and the surrounding chromosomal region. @Wiki
Missense Mutation: A missense mutation is a point mutation in which a single nucleotide change results in a codon that codes for a different amino acid. @Wiki
Nonsense Mutation: A nonsense mutation is a point mutation in a sequence of DNA that results in a premature stop codon, or a nonsense codon in the transcribed mRNA, and in a truncated, incomplete, and usually nonfunctional protein product.
Rare Variant: A rare functional variant is a genetic variant which alters gene function, and which occurs at low frequency in a population. @Wiki
Allele Frequency: Allele frequency, or gene frequency, is the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage. @Wiki
SV - Structural Variation: Structural variation (also genomic structural variation) is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations.
Transition vs Transversion: DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A <-> G) or of one-ring pyrimidines (C <-> T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures (A <-> C, A <-> T, G <-> C, G <-> T).
[IUPAC codes](http://www.bioinformatics.org/sms/iupac.html): The International Union of Pure and Applied Chemistry (IUPAC) has defined a standard representation of DNA bases by single characters that specify either a single base (e.g. G for guanine, A for adenine) or a set of bases (e.g. R for either G or A). UCSC uses these single character codes to represent multiple observed alleles of single-base polymorphisms. @UCSC

Genetic Variant/Disease Databases

dbSNP: The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI).
GWASdb: GWASdb is an online bioinformatics database combines collections of GVs from GWAS and their comprehensive functional annotations, as well as disease classifications.
GWAS Central: GWAS Central provides a centralized compilation of summary level findings from genetic association studies, both large and small.
OMIM - Online Mendelian Inheritance in Man®: An Online Catalog of Human Genes and Genetic Disorders
eMERGE: eMERGE is a national network that combines DNA biorepositories with electronic medical record (EMR) systems for large scale, high-throughput genetic research in support of implementing genomic medicine.
International HapMap Project: The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors.
MutDB: A database for assessing the impact of genetic variants. @Ref
The Genome of the Netherlands: 250 trios (father, mother and child) of Dutch descent.
ClinVar: ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation. @Ref
DGAP: To identify apparently balanced chromosomal rearrangements in patients with multiple congenital anomalies and then to use these chromosomal rearrangements to map and identify genes that are disrupted or dysregulated in critical stages of human development.
DECIPHER: DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) is an interactive web-based database which incorporates a suite of tools designed to aid the interpretation of genomic variants.
ClinGen: ClinGen is a National Institutes of Health (NIH)-funded resource dedicated to building an authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research.

Tools

SNPRelate: SNPRelate (high-performance computing R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations in GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures.
EIGENSOFT: The EIGENSOFT package combines functionality from our population genetics methods @Ref, EIGENSTRAT stratification correction method @Ref, and FastPCA and PC-based selection statistic @Ref.
PLATO: The PLatform for the Analysis, Translation, and Organization of large-scale data (PLATO) is a standalone program written in C++ that is designed to be a flexible and extensible analysis tool for a wide variety of genetic data.

Mutation-Protein Structure Studies

HotSpot3D: This 3D proximity tool can be used to identify the mutation hotspots in the linear 1D sequence and correlates these hotspots with known or potential interacting domains based on both known intermolecular interactions and calculated proximity for potential intramolecular interactions. @Ref
MuPIT_Interactive: webserver for mapping variant positions to annotated, interactive 3D structures. @Ref
Interactome3D: Interactome3D is a web service for the structural annotation of protein-protein interaction networks. @Ref
CLUMP: CLUMP (CLustering by Mutation Postion) is an unsupervised clustering of amino acid residue positions where variants occur, without any prior knowledge of their functional importance. @Ref

Somatic Mutations and Cancers

Related Terms

Cancer Predisposition Genes: Genes in which germline mutations confer highly or moderately increased risks of cancer. @Nature

Cancer Data Repositories

cBioPortal: The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets. @Ref, @Ref
TCGA: The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer.
COSMIC: COSMIC is an online database of somatically acquired mutations found in human cancer.
ProteinPaint: Explorer for genomic alteration in pediatric cancer. @Ref

Next Generate Sequencing

Techniques

DNA-seq: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. @Wiki
RNA-seq: RNA-seq (RNA sequencing), also called whole transcriptome shotgun sequencing[1] (WTSS), uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment in time. @Wiki
CLIP-Seq: @Wiki
FAIRE-seq: @Wiki
DNase-Seq: @Wiki
CAGE: @Wiki
ChIA-PET: @Wiki
5C/Hi-C: @Wiki
Promoter Capture Hi-C: Promoter capture Hi-C (PCHi-C) allows the genome-wide interrogation of physical interactions between distal DNA regulatory elements and gene promoters in multiple tissue contexts.

NGS Data Repositories

1000 Genomes Project: The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data.
Array Express: an NIH-funded database at the European Molecular Biology Laboratory -European Bioinformatics Institute that collects and disseminates microarray-based gene-expression data.
DDBJ: DNA Data Bank of Japan (DDBJ) is a data bank organized by the National Institute of Genetics in Japan that collects sequence data.
ENCODE: The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
GEO: GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted.
GermOnline: The GermOnline 4.0 gateway is a cross-species microarray expression database focusing on germline development, meiosis and gametogenesis as well as the mitotic cell cycle.
Roadmap Epigenomics Project: The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research.
Expression Atlas: The Expression Atlas provides information on gene expression patterns under different biological conditions such as a gene knock out, a plant treated with a compound, or in a particular organism part or cell. @Ref
ExAC: The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.
WTCCC: The Wellcome Trust Case Control Consortium (WTCCC) was established with an aim to harness the power of newly-available genotyping technologies to improve our understanding of the aetiological basis of several major causes of global disease.
CommonMind Consortium: The CMC is generating data across multiple regions from >1000 postmortem brain samples from donors with Schizophrenia, Bipolar disease and individuals with no neuropsychiatric disorders - originating from tissue collections at four brain banks. Data consists of DNA and RNA sequencing, genotyping and epigenetics.
TOPMed: The NHLBI Trans-Omics for Precision Medicine (TOPMed) program will support the Institute’s larger precision medicine activities by collecting and coupling whole-genome sequencing (WGS) and other -omics data (e.g., DNA methylation signature, RNA expression profiles, metabolite profiles) with molecular, behavioral, imaging, environmental, and clinical data from studies focused on heart, lung, blood and sleep (HLBS) disorders.

NGS Data Analysis

Read Simulation

ReadSim: ReadSim is a fast and simple reads simulator to target long reads such as PacBio or Nanopore.
simNGS: simNGS is software for simulating observations from Illumina sequencing machines using the statistical models behind the AYB base-calling software.

Read Trimming

Trimmomatic: A flexible read trimming tool for Illumina NGS data.
Sickle: A windowed adaptive trimming tool for FASTQ files using quality.
famas: Yet another program for FastQ massaging with features: Quality- and length-based trimming, Random sampling, Splitting into multiple files, Order checking for paired-end files, Native gzip support.

De-Duplication

PiCard MarkDuplicates: This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
sambamba-markdup: Find duplicate reads in BAM file.

Alignment

bwa: BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.
bowtie2: Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
ABRA: ABRA is a realigner for next generation sequencing data. It uses localized assembly and global realignment to align reads more accurately, thus improving downstream analysis (detection of indels and complex variants in particular). @Ref
NextGenMap: NextGenMap (NGM) is a flexible and fast read mapping program that is more than twice as fast as BWA, while achieving a mapping sensitivity similar to Stampy or Bowtie2. @Ref

Quality Control

ClinQC: ClinQC is an integrated and user-friendly pipeline for quality control, filtering and trimming of Sanger and NGS sequencing data for hundred to thousands of samples/patients in a single run in clinical research.
NGS QC Toolkit: NGS QC Toolkit: A toolkit for the quality control (QC) of next generation sequencing (NGS) data.

Peak Calling/Differential Peak Calling

Peak calling is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing or MeDIP-seq experiment. These areas are those where a protein interacts with DNA.
Differential peak calling is about identifying significant differences in two ChIP-seq signals.

MACS: Model-based analysis of ChIP-seq (MACS) is a computational algorithm that identifies genome-wide locations of transcription/chromatin factor binding or histone modification from ChIP-seq data.
DBChIP: detects differentially bound sharp binding sites across multiple conditions, with or without matching control samples.
MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets.
THOR: Differential peak calling of ChIP-seq signals with replicates. @Ref
ODIN: ODIN is an HMM-based approach to detect and analyse differential peaks in pairs of ChIP-seq data. ODIN performs genomic signal processing, peak calling and p-value calculation in an integrated framework. @Ref
MMDiff: This package detects statistically significant difference between read enrichment profiles in different ChIP-Seq samples. @Ref

RNA-seq data analyses

RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome@Ref
Limma: Linear Models for Microarray and RNA-Seq Data.
edgeR: Empirical Analysis of Digital Gene Expression Data in R.
DESeq: Differential gene expression analysis based on the negative binomial distribution.
Cufflinks: Transcriptome assembly and differential expression analysis for RNA-Seq.
MISO: MISO (Mixture-of-Isoforms) is a probabilistic framework that quantitates the expression level of alternatively spliced genes from RNA-Seq data, and identifies differentially regulated isoforms or exons across samples. @Ref

(Capture) Hi-C data analyses

CHiCAGO: CHiCAGO is a set of tools for calling significant interactions in Capture HiC data, such as Promoter Capture HiC. @Ref

Chromatin status data analysis

CENTIPEDE: CENTIPEDE applies a hierarchical Bayesian mixture model to infer regions of the genome that are bound by particular transcription factors. @Ref

Variant Calling

FaSD: a fast and accurate single-nucleotide polymorphism detection program that uses a binomial distribution-based algorithm and a mutation probability.
SOAPsnp: SOAPsnp uses a method based on Bayes’ theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment, and recurring experimental errors.
SNVmix: SNVMix is designed to detect single nucleotide variants from next generation sequencing data.
CNVnator: a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads.
bcftools: utilities for variant calling and manipulating VCFs and BCFs.
GATK: Genome Analysis Toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping.
Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format.
CONSERTING: integrating copy-number analysis with structural-variation detection. @Ref
CREST: CREST (Clipping Reveals Structure) is a new algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data. @Ref
Control-FREEC: Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data.
HMMcopy: Copy number prediction with correction for GC and mappability bias for HTS data
SegSeq: an algorithm to identify chromosomal breakpoints using massively parallel sequence data.
CNV-seq: a new method to detect copy number variation using high-throughput sequencing.
BICseq2: BICseq2 is an algorithm developed for the normalization of high-throughput sequencing (HTS) data and detection of copy number variations (CNV) in the genome. BICseq2 can be used for detecting CNVs with or without a control genome.
MuSE: a novel approach to mutation calling based on the F81 Markov substitution model for molecular evolution, which models the evolution of the reference allele to the allelic composition of the matched tumor and normal tissue at each genomic locus. @Ref
VarScan: a platform-independent software tool developed at the Genome Institute at Washington University to detect variants in NGS data.
Pindel: Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data.
COPS: A Sensitive and Accurate Tool for Detecting Somatic Copy Number Alterations Using Short-Read Sequence Data from Paired Samples. COPS is available at ftp://115.119.160.213 with username “cops” and password “cops”. @Ref
multiSNV: multiSNV is a tool for calling somatic single-nucleotide variants (SNVs) using NGS data from a normal and multiple tumour samples of the same patient. @Ref
SomaticSeq: SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. @Ref
Cosmos: COSMOS can detect somatic structural variations from whole genome short-read sequences. @Ref
Platypus: Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data.

Variant Filtering

SnpSift: SnpSift is a toolbox that allows you to filter and manipulate annotated files.
Varapp: Varapp is an open-source web application to filter variants from large sets of exome data stored in a relational database.

Variant Annotators

ANNOVAR: is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others).
SnpEff: Genetic variant annotation and effect prediction toolbox. @Ref
GEMINI: GEMINI (GEnome MINIng) is a flexible framework for exploring genetic variation in the context of the wealth of genome annotations available for the human genome. @Ref
Variant Effect Predictor: Analyse your own variants and predict the functional consequences of known and unknown variants via our Variant Effect Predictor (VEP) tool. @Ref
VAT - Variant Annotation Tool: A computational framework to functionally annotate variants in personal genomes using a cloud-computing environment. @Ref
SeattleSeq Variation Annotation: The SeattleSeq Annotation server provides annotation of SNVs (single-nucleotide variations) and small indels, both known and novel. @Ref
Jannovar: A Java Library for Exome Annotation.
Cellbase: CellBase is a scalable and high-performance NoSQL database that integrates relevant biological information from well-known data sources such as Ensembl, Uniprot, IntAct or ClinVar among others. All this data can be queried through a comprehensive RESTful web services API or using the command line interface. Also, a built-in variant annotator has been developed and can be used to annotate files containing variants in Variant Call Format (VCF). @Ref
GenomeD3Plot: GenomeD3Plot (formerly Islandplot) is an SVG based genome viewer written in javascript using D3.
Variant Tools: variant tools is a software tool for the manipulation, annotation, selection, simulation, and analysis of variants in the context of next-gen sequencing analysis. @Ref
GWAVA: GWAVA is a tool which aims to predict the functional impact of non-coding genetic variants based on a wide range of annotations of non-coding elements (largely from ENCODE/GENCODE), along with genome-wide properties such as evolutionary conservation and GC-content.

Variant prioritization

Mutsig: MutSig analyzes lists of mutations discovered in DNA sequencing, to identify genes that were mutated more often than expected by chance given background mutation processes. @Ref

Variant Simulation

SCNVSim: somatic copy number variation and structure variation simulator. @Ref

Haplotype Estimation Tools

PHASE: A program for reconstructing haplotypes from population data. PHASE was limited by its speed and was not applicable to datasets from genome-wide association studies. @Ref
fastPHASE: fastPHASE is software that implements methods for estimating missing genotypes and reconstructing haplotypes from unphased SNP genotype data of unrelated individuals. @Ref
IMPUTE2: IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009

NGS Data/Variant/Genome Visulizers/Browsers/Diagrams

UCSC Genome Browser: The UCSC Genome Browser is an on-line genome browser hosted by the University of California, Santa Cruz (UCSC).
JBrowse: a JavaScript genome browser by the open-source Generic Model Organism Database project. @Ref
Synthesis-View: A SNP visualization tool. @Ref
IGV - Integrative Genomics Viewer: A high-performance visualization tool for interactive exploration of large, integrated genomic datasets. @Ref
pileup.js: a Browser-based Genome Viewer.
Biodalliance: Biodalliance is a fast, interactive, genome visualization tool that's easy to embed in web pages and applications.
TADkit: TADkit is a HTML5 and JavaScript-based 3D genome browser. It makes use of D3.js for rendering the 1D and 2D tracks and WebGl by Three.js for rendering the 3D track.
ngs.plot: ngs.plot is a program that allows you to easily visualize your next-generation sequencing (NGS) samples at functional genomic regions. @Ref
CHiCP: a web-based tool for the integrative and interactive visualization of promoter capture Hi-C datasets. @Ref
WashU Epigenome Browser: An Epigenome browser.

NGS Data Analysis Pipeline/framework

nextflow: Nextflow is a fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner.
RUbioSeq+: RUbioSeq+ is a stand-alone and multiplatform application for the integrated analysis of NGS data. More specifically, our software implements pipelines for the analysis of single nucleotide and copy-number variation, bisulfite-seq and ChIP-seq experiments using well-established tools to perform these common tasks.
SpeedSeq: A flexible framework for rapid genome analysis and interpretation. @Ref
HICUP: pipeline for mapping and processing Hi-C data. @Ref
OpEx: Provides a fixed implementation of alignment, calling and annotation tools optimized for individual or multiple exome sequencing analysis in the research or clinical setting. @Ref

File Formats

Formats

BAM: BAM is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.
BED: The BED format consists of one line per feature, each containing 3-12 columns of data, plus optional track definition lines. @UCSC, @bedtools, @Ensembl
BigBed: The bigBed format stores annotation items that can either be simple, or a linked collection of exons, much as BED files do. BigBed files are created initially from BED type files, using the program bedToBigBed. The resulting bigBed files are in an indexed binary format. @UCSC
BigWig: The BigWig format is designed for dense, continuous data that is intended to be displayed as a graph. Files can be created from WIG or BedGraph files using the appropriate utility program. @UCSC
GFF: The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. @UCSC, @Ensembl, GTF(GFFv2)@GMOD, @Wiki
WIG: The WIG (wiggle) format is designed for display of dense continuous data such as probability scores. Wiggle data elements must be equally sized; if you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. @UCSC, @Ensembl
SAM: The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. @Wiki
VCF: The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. v4.0@1000genomes, @Wiki
MAF - Mutation Annotation Format: A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations. Tutorial@Biostars
pileup: Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. @Wiki
fasta: FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. @Wiki

Tools

BEDOPS: the fast, highly scalable and easily-parallelizable genome analysis toolkit.
bedtools: Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks.
bwtool: bwtool is a command-line utility for bigWig files.
sambamba: samtools functions with multi-threading support.
samtools: SAM Tools provide various utilities for manipulating alignments in the SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format.
VCFtools: A set of tools written in Perl and C++ for working with VCF files.
PiCard: Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
SVTools: Tools for processing and analyzing structural variants.

Math/Statistics

Tests

T Test: A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It can be used to determine if two sets of data are significantly different from each other. @Wiki
Chi-Squared Test: also referred to as a {\displaystyle \chi ^{2}} \chi ^{2} test, is any statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true. @Wiki
GSEA - Gene Set Enrichment Analysis: Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
fgsea: An R-package for fast preranked gene set enrichment analysis (GSEA).

Machine Learning

SVM - Support Vector Machine: support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. @Wiki

Clustering

Biclustering: Biclustering, block clustering, co-clustering, or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. @Wiki

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
images		images
javascripts		javascripts
stylesheets		stylesheets
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
index.html		index.html
params.json		params.json

jfx319/Bioinformatics-cheatsheet

Folders and files

Latest commit

History

Repository files navigation

Bioinformatics-cheatsheet

Table of Contents

General Elements

DNA/Gene/Genome

Related Terms

Genome/Sequence Databases

General Gene Databases

Specialized/Disease-associated Gene Databases

Gene Prediction

Promoter/TSS Prediction

RNA

Related Terms

Protein

Related Terms

Protein/Protein Domain Databases

Enhancer

Related Terms

Enhancer Databases

Enhancer Prediction

Interactions/Regulations/Associations

Transcription Factor - Target

Related Terms

Transcription Factor Databases

TFBS/TF Binding Motif/TF Target Databases

TFBS Prediction

Protein-DNA Interaction Detection Methods

Protein-Protein/Chemical Interaction

Related Terms

Protein-Protein Interaction Databases

Protein-Chemical Interaction Databases

PPI Detection Methods

Correlation Databases

Epigenetics

DNA Methylation

DNA Methylation Detection Methods

Histone Modification

Biological Processes

Pathways

Related Terms

Pathway Databases

Pathway Predictions

Pathway/Network analysis/visualizers

Drug/Chemicals

Drug/Small Molecule Database

Mutations and Diseases

GWAS

Related Terms

Meta-Analysis Tools

Downstream GWAS Analysis

Disease/Phenotype Databases/Ontologies

Germline Mutations and Genetic Diseases

Related Terms

Genetic Variant/Disease Databases

Tools

Mutation-Protein Structure Studies

Somatic Mutations and Cancers

Related Terms

Cancer Data Repositories

Next Generate Sequencing

Techniques

NGS Data Repositories

NGS Data Analysis

Read Simulation

Read Trimming

De-Duplication

Alignment

Quality Control

Peak Calling/Differential Peak Calling

RNA-seq data analyses

(Capture) Hi-C data analyses

Chromatin status data analysis

Variant Calling

Variant Filtering

Variant Annotators

Variant prioritization

Packages