A cheat sheet for Bioinformatians. @Github Pages
- General Elements
- Interactions/Regulations/Associations
- Epigenetics
- Biological Processes
- Drug/Chemicals
- Mutations and Diseases
- Next Generate Sequencing
- Techniques
- NGS Data Repositories
- NGS Data Analysis
- Read Simulation
- Read Trimming
- De-Duplication
- Alignment
- Quality Control
- Peak Calling/Differential Peak Calling
- RNA-seq data analyses
- (Capture) Hi-C data analyses
- Chromatin status data analysis
- Variant Calling
- Variant Filtering
- Variant Annotators
- Variant prioritization
- Variant Simulation
- Haplotype Estimation Tools
- NGS Data/Variant/Genome Visulizers/Browsers/Diagrams
- NGS Data Analysis Pipeline/framework
- File Formats
- Math/Statistics
- DNA: Deoxyribonucleic acid is a molecule that carries the genetic instructions used in the growth, development, functioning and reproduction of all known living organisms and many viruses. @Wiki, @NIH
- Gene: A gene is a locus (or region) of DNA which is made up of nucleotides and is the molecular unit of heredity. @Wiki, @NIH
- Promoter: In genetics, a promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites of genes, on the same strand and upstream on the DNA (towards the 5' region of the sense strand). @Wiki
- TSS - Transcription Start Side: The transcription start site is the location where transcription starts at the 5'-end of a gene sequence. @Wiki
- Expression (Gene expression): Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as transfer RNA (tRNA) or small nuclear RNA (snRNA) genes, the product is a functional RNA. @Wiki, @Scitable
- Exon: An exon is any part of a gene that will become a part of the final mature RNA produced by that gene after introns have been removed by RNA splicing. @Wiki
- Intron: An intron is any nucleotide sequence within a gene that is removed by RNA splicing during maturation of the final RNA product. @Wiki
- SO - Sequence Ontology: SO is a collaborative ontology project for the definition of sequence features used in biological sequence annotation.
- Ensembl genome browser: Ensembl is a joint project between EMBL - EBI and the Sanger Institute to develop a software system which produces and maintains automatic annotation on selected eukaryotic genomes.
- Candida Genome Database: Resource for genomic sequence data and gene and protein information for Candida albicans.
- WormBase: Worm Base.
- FlyBase: FlyBase: a database of Drosophila Genes & Genomes.
- MGI - Mouse Genome Informatics: MGI is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease.
- RGD - Rat Genome Database: The Rat Genome Database (RGD) is the premier site for genetic, genomic, phenotype, and disease data generated from rat research.
- Saccharomyces Genome Database: The Saccharomyces Genome Database (SGD) provides comprehensive integrated biological information for the budding yeast Saccharomyces cerevisiae.
- H-InvDB: H-Invitational Database (H-InvDB) is an integrated database of human genes and transcripts.
- KEGG GENES: KEGG GENES is a collection of gene catalogs for all complete genomes generated from publicly available resources, mostly NCBI RefSeq and GenBank.
- HGNC: HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication.
- GeneCards: GeneCards is a searchable, integrated, database of human genes that provides concise genomic related information, on all known and predicted human genes.
- NCBI Gene: A portal to gene-specific content based on NCBI's RefSeq project, information from model organism databases, and links to other resources.
- WikiGenes: WikiGenes is a non-profit initiative to provide a global collaborative knowledge base for the life sciences, where authorship matters.
- GENCODE: Encyclopedia of genes and gene variants.
- Harmonizome: a collection of processed datasets gathered to serve and mine knowledge about genes and proteins. @Ref
- BGF: It is a hidden Markov model (HMM) and dynamic programming based ab initio gene prediction program.
- PePPER: Prediction of prokaryote promoters.
- Promoter2.0: Promoter2.0 predicts transcription start sites of vertebrate PolII promoters in DNA sequences.
- RNA: Ribonucleic acid (RNA) is a polymeric molecule implicated in various biological roles in coding, decoding, regulation, and expression of genes. @Wiki
- 3'-UTR: is the section of messenger RNA (mRNA) that immediately follows the translation termination codon. @Wiki
- 5'-UTR: The 5' untranslated region (5′ UTR) (also known as a Leader Sequence or Leader RNA) is the region of an mRNA that is directly upstream from the initiation codon. @Wiki
- Protein: Proteins are large biomolecules, or macromolecules, consisting of one or more long chains of amino acid residues. @Wiki
- Translation: In molecular biology and genetics, translation is the process in which cellular ribosomes create proteins. In translation, messenger RNA (mRNA)—produced by transcription from DNA—is decoded by a ribosome to produce a specific amino acid chain, or polypeptide. @Wiki
- iPfam: Protein families database of alignments and HMMs.
- iProClass: The iProClass database provides value-added information reports for UniProtKB and unique NCBI Entrez protein sequences in UniParc, with links to over 160 biological databases, including databases for protein families, functions and pathways, interactions, structures and structural classifications, genes and genomes, ontologies, literature, and taxonomy.
- MiST: The Microbial Signal Transduction database contains the signal transduction proteins for bacterial and archaeal genomes.
- ModBase: ModBase is a database of comparative protein structure models, calculated by modeling pipeline ModPipe.
- RCSB PDB: The PDB archive contains information about experimentally-determined structures of proteins, nucleic acids, and complex assemblies.
- PepBank: PepBank is a database of peptides based on sequence text mining and public peptide data sources.
- PROFESS: PROFESS is a biology database system that integrates databases describing PROtein Functions, Evolution, Structures and Sequences.
- ProtCID: PROTein Common Interfaces Database.
- SUBA3: The SUBcellular localization database for Arabidopsis proteins.
- SynSysNet: Synaptic Proteins Database.
- ASD: Allosteric Database.
- Enhancer: In genetics, an enhancer is a short (50-1500 bp) region of DNA that can be bound by proteins (activators) to increase the likelihood that transcription of a particular gene will occur. These proteins are usually referred to as transcription factors. @Wiki
- Super Enhancer: In genetics, a super-enhancer is a region of the mammalian genome comprising multiple enhancers that is collectively bound by an array of transcription factor proteins to drive transcription of genes involved in cell identity. @Wiki, @Nature Genetics
- MPRA: MPRA is a high-throughput technology that enables the analysis of transcriptional activities of thousands of regulatory elements in a single experiment. @Ref
- VISTA Enhancer Browser: The VISTA Enhancer Browser is a central resource for experimentally validated human and mouse noncoding fragments with gene enhancer activity as assessed in transgenic mice.
- DENdb: DENdb is a centralized on-line repository of predicted enhancers derived from multiple human cell-lines.
- dbSUPER: dbSUPER is the first integrated and interactive database of super-enhancers.
- SEA: a super-enhancer archive.
- EI: Database of EI candidate tissue-specific enhancers: Predicting Tissue-Specific Enhancers in the Human Genome. @Ref
- EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. @Ref
- DEEP: a general computational framework for predicting enhancers
- TF - Transcription Factor: Transcription factors are proteins that control which genes are turned on or off in the genome. They do so by binding to DNA and other proteins. @Wiki, @BroadInstitute, @Scitable
- PWM - Position Weight Matrix/PSWM/PSSM: A position weight matrix (PWM), also known as a position-specific weight matrix (PSWM) or position-specific scoring matrix (PSSM), is a commonly used representation of motifs (patterns) in biological sequences. @Wiki
- TFBS - Transcription Factor Binding Site/DNA Binding Site: DNA binding sites are a type of binding site found in DNA where other molecules may bind. @Wiki
- DNA Sequence Motif: Sequence motifs are short, recurring patterns in DNA that are presumed to have a biological function. @Nature Biotechnology, @Wiki
- Transcription: Transcription is the first step of gene expression, in which a particular segment of DNA is copied into RNA (mRNA) by the enzyme RNA polymerase. @Wiki
- AnimalTFDB: AnimalTFDB is a comprehensive database including classification and annotation of genome-wide transcription factors (TFs), transcription co-factors and chromatin remodeling factors in 65 animal genomes.
- DBD: DBD is a database of predicted transcription factors in completely sequenced genomes.
- PlantTFDB: Plant transcription factor database, a portal for the functional and evolutionary study of plant transcription factors.
- TFCat: TFCat: The curated catalog of mouse and human transcription factors.
- TFdb: The Mouse transcription factor database (TFdb) is a database containing mouse transcription factor genes and their related genes.
- Cistrome DB: Cistrome DB is a comprehensive resource of hg38 and mm10 ChIP-seq data collection. Here is a brief introduction about the workflow of ChiLin.
- CollecTF: CollecTF is a database of transcription factor binding sites (TFBS) in the Bacteria domain.
- CTCFBSDB: A database for CTCF binding sites and genome organization.
- FactorBook: This website organizes the analysis results of ENCODE TF ChIP-seq data, integrated with other ENCODE data such as ChIP-seq of histone marks and nucleosome occupancy.
- footprintDB: footprintDB is a web server for assigning putative cis DNA motifs to input transcription factors (TFs) and conversely for predicting which TFs that might recognize input DNA motifs.
- hmChIP: hmChIP is a database of genome-wide chromatin immu-noprecipitation (ChIP) data in human and mouse.
- HOCOMOCO: HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) contains transcription factor (TF) binding models represented as classic Position Weight Matrices (PWMs, also known as Position-Specific Scoring Matrices, PSSMs) and precalculated score thresholds.
- HOMER Motif Database: This database is maintained as part of HOMER and is mostly based on the analysis of public ChIP-Seq data sets.
- hPDI: The hPDI database holds experimental protein-DNA interaction data for humans identified by protein microarray assays.
- HTRIdb: Human Transcriptional Regulation Interaction Database.
- JASPAR: The high-quality transcription factor binding profile database.
- MAPPER: MAPPER is a platform for the computational identification of transcription factor binding sites (TFBSs) in multiple genomes, that combines TRANSFAC® and JASPAR data with the search power of profile hidden Markov models (HMMs).
- MotifMap: The MotifMap system provides comprehensive maps of candidate regulatory elements encoded in the genomes of model species using databases of transcription factor binding motifs, refined genome alignments, and a comparative genomic statistical approach - Bayesian Branch Length Score.
- oPOSSUM: oPOSSUM is a web-based system for the detection of over-represented conserved transcription factor binding sites and binding site combinations in sets of genes or sequences.
- SwissRegulon: Swissregulon Database contains genome-wide annotations of regulatory sites.
- TFBSshape: TFBSshape provides DNA shape features for transcription factor binding sites (TFBSs) that in addtion to sequence features, usually in the form of position weight matrices (PWMs), characterize DNA binding specificities of transcription factors (TFs) from 23 different species.
- TRANSFAC: TRANSFAC® is a unique knowledge-base containing published data on eukaryotic transcription factors and miRNAs, their experimentally-proven binding sites, and regulated genes.
- UniPROBE: The UniPROBE (Universal PBM Resource for Oligonucleotide Binding Evaluation) database hosts data generated by universal protein binding microarray (PBM) technology on the in vitro DNA binding specificities of proteins.
- DeepSEA: DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. @Ref
- PBM - Protein Binding Microarray: //TODO
- ChIP: Chromatin Immunoprecipitation (ChIP) is a type of immunoprecipitation experimental technique used to investigate the interaction between proteins and DNA in the cell. @Wiki
- ChIP-seq: ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. @Wiki
- ChIP-chip: ChIP-chip (also known as ChIP-on-chip) is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. @Wiki
- PPI - Protein-Protein Interaction: Protein–protein interactions (PPIs) refer to lasting or ephemeral physical contacts of high specificity established between two or more protein molecules as a result of biochemical events steered by electrostatic forces including the hydrophobic effect. @Wiki
- 2P2Idb: 2P2IDB is a hand-curated database dedicated to the structure of protein-protein complexes with known small molecule inhibitors.
- 3D-Interologs: The 3D-Interologs is a cross-species interacting database inferring from three-dimensional (3D) protein structure complexes and a novel scoring function by using 3D-domain interologs.
- 3DID: The database of three-dimensional interacting domains (3did) is a collection of high-resolution three-dimensional structural templates for domain-domain interactions.
- ANAP: Arabidopsis Network Analysis Pipeline.
- AntiJen: AntiJen v2.0, is a database containing quantitative binding data for peptides binding to MHC Ligand, TCR-MHC Complexes, T Cell Epitope, TAP , B Cell Epitope molecules and immunological Protein-Protein interactions.
- APID: APID (Agile Protein Interactomes DataServer) provides a comprehensive collection of protein interactomes for more than 400 organisms based in the integration of known experimentally validated protein-protein physical interactions (PPIs).
- ASPD: ASPD (Artificial Selected Proteins/Peptides Database) is a curated database on selected from randomized pools proteins and peptides.
- ATDB:ATDB mainly focuses on construct a globe-scale animal toxin-channel interaction network based on literatures and database annotations.
- AtPID: Arabidopsis thaliana Protein Interactome Database.
- Bacteriome.org: Bacterial Protein Interaction Database for Escherichia Coli.
- BIANA: Biologic Interaction and Network Analysis.
- BID: Binding Interface Database.
- BioGRID: BioGRID Is An Online Interaction Respository With Data Compiled Through Comprehensive Curation Efforts.
- BISC: BISC(BInary SubComplex Database) is a new protein-protein interaction (PPI) database intending to bridge between the two communities most active in their characterisation: structural biology and functional genomics researchers.
- CCSB Interactome Database: Center for Cancer Systems Biology Interactome Database.
- ComSim: Database of protein structures in bound (Complex) and unbound (Single) states.
- CORUM: Comprehensive resource of mammalian protein complexes.
- CTDB: Calmodulin Target Database.
- CutDB: CutDB: Proteolytic Event Database.
- DeathDomain: A manually curated database of protein-protein interactions for Death Domain Superfamily.
- DIMA: DIMA is a Domain Interaction MAp and aims at becoming a comprehensive resource for functional and physical interactions among conserved protein-domains.
- DIP: The DIPTM database catalogs experimentally determined interactions between proteins.
- DOMINE: DOMINE is a database of known and predicted protein domain (domain-domain) interactions.
- DOMINO: DOMINO is an open-access database comprising more than 3900 annotated experiments describing interactions mediated by protein-interaction domains.
- DOMMINO: Database of MacroMolecular Interactions .
- DroID: DroID is a comprehensive gene and protein interactions (interactome) database designed specifically for the model organism Drosophila.
- DroPNet: Drosophila Protein Network.
- EciD: E. coli Interaction Database.
- FunCoup: FunCoup is a framework to infer genome-wide functional couplings in 11 model organisms.
- Gene3D: Gene3D takes CATH domain families (from PDB structures) and assigns them to the millions protein sequences (using Hidden Markov models generated from HMMER) with no PDB structures.
- gpDB: a database of GPCRs, G-proteins, Effectors and their interactions.
- GWIDD: Genome-WIde protein Docking Database.
- HCPIN: Human Cancer Protein Interaction Network.
- HCVpro: Hepatitus C Virus Protein Interaction Database.
- HINT:HINT (High-quality INTeractomes) is a database of high-quality protein-protein interactions in different organisms.
- HitPredict: HitPredict is a resource of experimentally determined protein-protein interactions with reliability scores.
- HIV-1 Human Interaction Database: The HIV-1, human interactions project collates published reports of two types of interactions - protein interactions, and human gene knock-downs that affect virus replication and infectivity (reported as 'replication interactions').
- HIVMID: HIV Molecular Immunology Database.
- HotRegion: A Database of Cooperative Hotspots.
- HP-DPI: Helicobacter pylori Database of Protein Interactomes.
- HPID: Human Protein Interaction Database.
- HPIDB: Host-Pathogen Interaction Database.
- HPRD: The Human Protein Reference Database represents a centralized platform to visually depict and integrate information pertaining to domain architecture, post-translational modifications, interaction networks and disease association for each protein in the human proteome.
- Human-gpDB: A database of human GPCRs, G-proteins, Effectors and their interactions.
- HumanPSD: Human Proteome Survey Database .
- HuPI: database of the Human Proteotheque Initiative.
- I2D: Interologous Interaction Database.
- IBIS: Inferred Biomolecular Interactions (protein-protein, protein-small molecule, protein nucleic acids and protein-ion interactions) Server.
- ICBS: A database of protein-protein interactions mediated by interchain ß-sheet formation.
- IMEx: The International Molecular Exchange Consortium.
- iMOTdb: Interacting motifs in proteins database.
- InnateDB: A Knowledge Resource for Innate Immunity Interactions and Pathways.
- INstruct: a database of 3D protein interactome networks with structural resolution.
- IntAct: IntAct provides a freely available, open source database system and analysis tools for molecular interaction data.
- Interactome: Krogan Lab Interactome Database.
- InterDom: InterDom is a database of putative interacting protein domains derived from multiple sources, ranging from domain fusions (Rosetta Stone), protein interactions (DIP and BIND), protein complexes (PDB), to scientific literature (MEDLINE).
- InterEvol: InterEvol database is designed for the analysis of co-evolution events at the interface of known structures of hetero- and homo-oligomers.
- Interfaces: DATASET OF PROTEIN-PROTEIN INTERFACES.
- Interolog: Interolog/Regulog Database.
- InteroPorc: InteroPorc is an automatic prediction tool to infer protein-protein interaction networks.
- iRefIndex: iRefIndex provides an index of protein interactions available in a number of primary interaction databases including BIND, BioGRID, CORUM, DIP, HPRD, InnateDB, IntAct, MatrixDB, MINT, MPact, MPIDB and MPPI.
- iRefWeb: Interaction Reference Index Web Interface.
- IRView: a database and viewer of interacting regions (IRs) in protein sequences.
- MatrixDB: MatrixDB stores experimental data established by full-length proteins, matricryptins, glycosaminoglycans, lipids and cations.
- MINT: MINT focuses on experimentally verified protein-protein interactions mined from the scientific literature by expert curators.
- MIPS-MPPI: MIPS Mammalian Protein-Protein Interaction Database.
- MPI-LIT: the microbial protein interaction database.
- MPID: Magnaporthe grisea Protein-protein Interaction Database.
- MPID-T: MHC-Peptide Interaction Database.
- MPIDB: Microbial Protein Interaction Database.
- NCG: NCG collects information on duplicability, orthology, evolutionary appearance and protein interactions network (PIN) properties of 736 cancer genes.
- NCPI: Neurospora Crassa Protein Interactome Database.
- Negatome: The Negatome is a collection of protein and domain pairs which are unlikely engaged in direct physical interactions.
- PRISM: Protein Interactions by Structural Matching.
- PCRPi-DB: PCRPi-DB is a database of computationally annotated hot spots in protein interfaces.
- PDZBase: PDZBase is a manually curated protein-protein interaction database developed specifically for interactions involving PDZ domains.
- PICCOLO: PICCOLO is a comprehensive database of structurally-characterized protein-protein interactions described at atomic level.
- PIPs: PIPs is a database of predicted human protein-protein interactions.
- PiSITE: PiSITE is a web-based database of protein interaction sites.
- PPIRA: Protein-Protein Interactions between Ralstonia solanacearum and Arabidopsis thaliana.
- PDBePISA: PDBePISA is an interactive tool for the exploration of macromolecular interfaces.
- PRIN: Predicted Rice Interactome Database.
- RKD: Rice Kinase Database.
- SCOPPI: Structural classification of protein-protein interfaces.
- SCOWLP: structural classification of protein binding reasons for atomic comparative analysis of protein interactions.
- SNAPPI-DB: Structures, iNterfaces and Alignments for Protein-Protein Interactions.
- STRING: functional protein association networks.
- Struct2Net: Structure-based Computational Predictions of Protein-Protein Interactions.
- SYFPEITHI: Database of MHC Ligands and Peptide Motifs.
- TissueNet: The Database of Human Tissue Protein-Protein Interactions.
- TRIP: a manually curated database of protein-protein interactions for mammalian TRP channels.
- Wiki-Pi: Wiki-Pi: a wiki resource centred on human protein-protein interactions.
- XooNET: Integrated Protein-Protein Interaction database of Xanthomonas oryzae pathovar oryzae KACC1031.
- ChemProt: The ChemProt 3.0 server is a ressource of annotated and predicted chemical-protein interactions.
- CoIP - Co-immunoprecipitation: is considered to be the gold standard assay for protein–protein interactions, especially when it is performed with endogenous (not overexpressed and not tagged) proteins. The protein of interest is isolated with a specific antibody. Interaction partners which stick to this protein are subsequently identified by Western blotting. Interactions detected by this approach are considered to be real.
- Bimolecular fluorescence complementation: (BiFC) is a new technique in observing the interactions of proteins. Combining with other new techniques, this method can be used to screen protein–protein interactions and their modulators, DERB.
- Affinity electrophoresis: as used for estimation of binding constants, as for instance in lectin affinity electrophoresis or characterization of molecules with specific features like glycan content or ligand binding.
- Pull-down assays: are a common variation of immunoprecipitation and immunoelectrophoresis and are used identically, although this approach is more amenable to an initial screen for interacting proteins.
- Label transfer: can be used for screening or confirmation of protein interactions and can provide information about the interface where the interaction takes place. Label transfer can also detect weak or transient interactions that are difficult to capture using other ''in vitro'' detection strategies. In a label transfer reaction, a known protein is tagged with a detectable label. The label is then passed to an interacting protein, which can then be identified by the presence of the label.
- Y2H - Yeast Two-Hybrid: Y2H screen investigates the interaction between artificial fusion proteins inside the nucleus of yeast. This approach can identify binding partners of a protein in an unbiased manner.
- Phage display: used for the high-throughput screening of protein interactions
- TAP - Tandem Affinity Purification: (TAP) method allows high throughput identification of protein interactions. In contrast to yeast two-hybrid approach the accuracy of the method can be compared to those of small-scale experiments and the interactions are detected within the correct cellular environment as by co-immunoprecipitation. However, the TAP tag method requires two successive steps of protein purification and consequently it can not readily detect transient protein–protein interactions.
- Cross-link/Chemical cross-linking: is often used to "fix" protein interactions in place before trying to isolate/identify interacting proteins. Common crosslinkers for this application include the non-cleavable NHS-ester cross-linker, bissulfosuccinimidyl suberate (BS3); a cleavable version of BS3, dithiobis(sulfosuccinimidyl propionate) (DTSSP); and the imidoester cross-linker dimethyl dithiobispropionimidate (DTBP) that is popular for fixing interactions in ChIP assays.
- SPINE: (Strepprotein interaction experiment) uses a combination of reversible crosslinking with formaldehyde and an incorporation of an affinity tag to detect interaction partners ''in vivo''.
- Quantitative immunoprecipitation combined with knock-down: (QUICK) relies on co-immunoprecipitation, quantitative mass spectrometry (SILAC) and RNA interference (RNAi). This method detects interactions among endogenous non-tagged proteins. Thus, it has the same high confidence as co-immunoprecipitation. However, this method also depends on the availability of suitable antibodies.
- Proximity ligation assay: (PLA) in situ is an immunohistochemical method utilizing so called PLA probes for detection of proteins, protein interactions and modifications.
- CORNET: CORrelation NETworks.
- GeneMANIA: GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional association data.
"The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence." @Ref
- MeDIP/mDIP - Methylated DNA immunoprecipitation: Methylated DNA immunoprecipitation (MeDIP or mDIP) is a large-scale (chromosome- or genome-wide) purification technique in molecular biology that is used to enrich for methylated DNA sequences. @Wiki
- MeDIP-seq: The MeDIP-seq approach, i.e. the coupling of MeDIP with next generation, short-read sequencing technologies such as 454, Illumina (company) (Solexa), was first described by Down et al. in 2008. The high-throughput sequencing of the methylated DNA fragments produces a large number of short reads (36-50bp or 400 bp depending on the technology).
- Pathway (Biological pathway): A biological pathway is a series of actions among molecules in a cell that leads to a certain product or a change in a cell. Such a pathway can trigger the assembly of new molecules, such as a fat or protein. Pathways can also turn genes on and off, or spur a cell to move. @Wiki, @NIH
- ConsensusPathDB: ConsensusPathDB is a database that integrates different types of functional interactions between physical entities in the cell like genes, RNA, proteins, protein complexes and metabolites in order to assemble a more complete and a less biased picture of cellular biology.
- KEGG PATHWAY: KEGG PATHWAY is a collection of manually drawn pathway maps representing our knowledge on the molecular interaction and reaction networks for different aspects.
- MetaCyc: MetaCyc is a curated database of experimentally elucidated metabolic pathways from all domains of life.
- MouseCyc: MouseCyc is a database of curated biochemical pathways data for the laboratory mouse that can be integrated with functional and phenotypic data from MGI.
- PANTHER Pathway: PANTHER Pathway consists of over 177, primarily signaling, pathways, each with subfamilies and protein sequences mapped to individual pathway components.
- Pathway Commons: Pathway Commons aims to store and disseminate knowledge about biological pathways. Information is sourced from public pathway databases and is readily searched, visualized, and downloaded.
- Reactome: Reactome is a free, open-source, curated and peer reviewed pathway database.
- PLANTCYC: PlantCyc is a metabolic pathway reference database containing more than 800 pathways and their catalytic enzymes and genes, as well as compounds from over 350 plant species. It includes: AraCyc(Arabidopsis thaliana col), BarleyCyc(Hordeum vulgare), BrachypodiumCyc(Brachypodium distachyon), CassavaCyc(Manihot esculenta esculenta), ChineseCabbageCyc(Brassica rapa ssp. pekinensis), ChlamyCyc(Chlamydomonas reinhardtii), CornCyc(Zea mays mays), GrapeCyc(Vitis vinifera), MossCyc(Physcomitrella patens), OryzaCyc(Oryza sativa japonica group), PapayaCyc(Carica papaya), PoplarCyc(Populus trichocarpa, other Populus species and hybrids), PotatoCyc(Solanum tuberosum), SelaginellaCyc(Selaginella moellendorffii), SetariaCyc(Setaria italica), SorghumBicolorCyc(Sorghum bicolor), SoyCyc(Glycine max), SpirodelaCyc(Spirodela polyrhiza), SwitchgrassCyc(Panicum virgatum), TomatoCyc(Solanum lycopersicum), WheatACyc(Triticum urartu), WheatDCyc(Aegilops tauschii)
- SignaLink: An integrated resource to analyze signaling pathway cross-talks, transcription factors, miRNAs and regulatory enzymes.
- SMPDB: SMPDB (The Small Molecule Pathway Database) is an interactive, visual database containing more than 618 small molecule pathways found in humans. More than 70% of these pathways (>433) are not found in any other pathway database.
- Yeast Pathways Database: The Yeast Pathways Database is a collection of manually curated metabolic pathways and enzymes of Saccharomyces cerevisiae.
- PIUMet: Inferring Disease-Modifying Pathways and Hidden Components via Integrative Analysis of Metabolite Features with Various Omic Data. @Ref
- HotNet2: HotNet2 is an algorithm for finding significantly altered subnetworks in a large gene interaction network. While originally developed for use with cancer mutation data, the current release also supports any application in which meaningful scores can be assigned to genes in the network. @Ref
- CellMaps: CellMaps is an open source HTML5 web-based application that allows researchers to easily model, visualize, integrate data and analyse biological networks inside a web browser.
- AHD2.0: The aim of the Arabidopsis hormone database is to provide a systematic and comprehensive view of morphological phenotypes regulated by plant hormones, as well as regulatory genes participating in numerous plant hormone responses.
- DrugBank: The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information.
- TTD - Therapeutic Target Database: A database to provide information about the known and explored therapeutic protein and nucleic acid targets, the targeted disease, pathway information and the corresponding drugs directed at each of these targets.
- STITCH: STITCH is a resource to explore known and predicted interactions of chemicals and proteins.
- GWAS: In genetics, a genome-wide association study (GWA study, or GWAS), also known as whole genome association study (WGA study, or WGAS), is an examination of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. @Wiki
- Meta-analysis: Meta-analysis is routinely used for pooling the results from genome-wide association studies (GWAS). @Review
- TWAS: transcriptome-wide association study through expression imputation. @Ref
- RAREMETAL: RAREMETAL is a program that facilitates the meta-analysis of rare variants from genotype arrays or sequencing.
- PrediXcan: PrediXcan is a gene-based association test that prioritizes genes that are likely to be causal for the phenotype. @Ref
- MetaXcan: MetaXcan is an extension of PrediXcan method, that infers the results of PrediXcan using only summary statistics.
- PredictDB: This PredictDB Data Repository hosts genetic prediction models of transcriptome levels to be used with PrediXcan and MetaXcan.
- HPO - Human Phenotype Ontoloy: The Human Phenotype Ontology (HPO) aims to provide a standardized vocabulary of phenotypic abnormalities encountered in human disease.
- DO - Disease Ontology: The Disease Ontology semantically integrates disease and medical vocabularies through extensive cross mapping of DO terms to MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM.
- MeSH - Medical Subject Headings: MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.
- CNV - Copy-Number Variation: Copy-number variations (CNVs) are a form of structural variation that manifest as deletions or duplications in the genome. @Wiki(https://www.genome.gov/25520880/deoxyribonucleic-acid-dna-fact-sheet/)
- SNP: A single nucleotide polymorphism, often abbreviated to SNP (pronounced snip; plural snips), is a variation in a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population (e.g. >1%). @Wiki, @NIH
- Genotype: The genotype is the part (DNA sequence) of the genetic makeup of a cell, and therefore of an organism or individual, which determines a specific characteristic (phenotype) of that cell/organism/individual. @Wiki
- PheWAS: Phenome-wide association studies (PheWAS) analyze many phenotypes compared to a single genetic variant (or other attribute). @Tool
- Haplotype: A haplotype (haploid genotype) is a group of genes in an organism that are inherited together from a single parent. A haplogroup is a group of similar haplotypes that share a common ancestor with a single-nucleotide polymorphism mutation. @Wiki, @Scitable
- IBD - Identity By Descent: A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. @Wiki
- Phasing/Haplotype Estimation: In genetics, haplotype estimation (also known as "phasing") refers to the process of statistical estimation of haplotypes from genotype data.
- ICD: The International Statistical Classification of Diseases and Related Health Problems, usually called by the short-form name International Classification of Diseases (ICD), is the international "standard diagnostic tool for epidemiology, health management and clinical purposes". @Wiki
- MAF - Minor Allele Frequency: MAF refers to the frequency at which the second most common allele occurs in a given population. SNPs with a minor allele frequency of 5% or greater were targeted by the HapMap project.
- LOH - Loss Of Heterozygosity: Loss of heterozygosity (LOH) is a gross chromosomal event that results in loss of the entire gene and the surrounding chromosomal region. @Wiki
- Missense Mutation: A missense mutation is a point mutation in which a single nucleotide change results in a codon that codes for a different amino acid. @Wiki
- Nonsense Mutation: A nonsense mutation is a point mutation in a sequence of DNA that results in a premature stop codon, or a nonsense codon in the transcribed mRNA, and in a truncated, incomplete, and usually nonfunctional protein product.
- Rare Variant: A rare functional variant is a genetic variant which alters gene function, and which occurs at low frequency in a population. @Wiki
- Allele Frequency: Allele frequency, or gene frequency, is the relative frequency of an allele (variant of a gene) at a particular locus in a population, expressed as a fraction or percentage. @Wiki
- SV - Structural Variation: Structural variation (also genomic structural variation) is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations.
- Transition vs Transversion: DNA substitution mutations are of two types. Transitions are interchanges of two-ring purines (A <-> G) or of one-ring pyrimidines (C <-> T): they therefore involve bases of similar shape. Transversions are interchanges of purine for pyrimidine bases, which therefore involve exchange of one-ring and two-ring structures (A <-> C, A <-> T, G <-> C, G <-> T).
- [IUPAC codes](http://www.bioinformatics.org/sms/iupac.html): The International Union of Pure and Applied Chemistry (IUPAC) has defined a standard representation of DNA bases by single characters that specify either a single base (e.g. G for guanine, A for adenine) or a set of bases (e.g. R for either G or A). UCSC uses these single character codes to represent multiple observed alleles of single-base polymorphisms. @UCSC
- dbSNP: The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI).
- GWASdb: GWASdb is an online bioinformatics database combines collections of GVs from GWAS and their comprehensive functional annotations, as well as disease classifications.
- GWAS Central: GWAS Central provides a centralized compilation of summary level findings from genetic association studies, both large and small.
- OMIM - Online Mendelian Inheritance in Man®: An Online Catalog of Human Genes and Genetic Disorders
- eMERGE: eMERGE is a national network that combines DNA biorepositories with electronic medical record (EMR) systems for large scale, high-throughput genetic research in support of implementing genomic medicine.
- International HapMap Project: The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors.
- MutDB: A database for assessing the impact of genetic variants. @Ref
- The Genome of the Netherlands: 250 trios (father, mother and child) of Dutch descent.
- ClinVar: ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. ClinVar thus facilitates access to and communication about the relationships asserted between human variation and observed health status, and the history of that interpretation. @Ref
- DGAP: To identify apparently balanced chromosomal rearrangements in patients with multiple congenital anomalies and then to use these chromosomal rearrangements to map and identify genes that are disrupted or dysregulated in critical stages of human development.
- DECIPHER: DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) is an interactive web-based database which incorporates a suite of tools designed to aid the interpretation of genomic variants.
- ClinGen: ClinGen is a National Institutes of Health (NIH)-funded resource dedicated to building an authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research.
- SNPRelate: SNPRelate (high-performance computing R packages for multi-core symmetric multiprocessing computer architectures) to accelerate two key computations in GWAS: principal component analysis (PCA) and relatedness analysis using identity-by-descent (IBD) measures.
- EIGENSOFT: The EIGENSOFT package combines functionality from our population genetics methods @Ref, EIGENSTRAT stratification correction method @Ref, and FastPCA and PC-based selection statistic @Ref.
- PLATO: The PLatform for the Analysis, Translation, and Organization of large-scale data (PLATO) is a standalone program written in C++ that is designed to be a flexible and extensible analysis tool for a wide variety of genetic data.
- HotSpot3D: This 3D proximity tool can be used to identify the mutation hotspots in the linear 1D sequence and correlates these hotspots with known or potential interacting domains based on both known intermolecular interactions and calculated proximity for potential intramolecular interactions. @Ref
- MuPIT_Interactive: webserver for mapping variant positions to annotated, interactive 3D structures. @Ref
- Interactome3D: Interactome3D is a web service for the structural annotation of protein-protein interaction networks. @Ref
- CLUMP: CLUMP (CLustering by Mutation Postion) is an unsupervised clustering of amino acid residue positions where variants occur, without any prior knowledge of their functional importance. @Ref
- Cancer Predisposition Genes: Genes in which germline mutations confer highly or moderately increased risks of cancer. @Nature
- cBioPortal: The cBioPortal for Cancer Genomics provides visualization, analysis and download of large-scale cancer genomics data sets. @Ref, @Ref
- TCGA: The Cancer Genome Atlas (TCGA) is a collaboration between the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI) that has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer.
- COSMIC: COSMIC is an online database of somatically acquired mutations found in human cancer.
- ProteinPaint: Explorer for genomic alteration in pediatric cancer. @Ref
- DNA-seq: DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. @Wiki
- RNA-seq: RNA-seq (RNA sequencing), also called whole transcriptome shotgun sequencing[1] (WTSS), uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment in time. @Wiki
- CLIP-Seq: @Wiki
- FAIRE-seq: @Wiki
- DNase-Seq: @Wiki
- CAGE: @Wiki
- ChIA-PET: @Wiki
- 5C/Hi-C: @Wiki
- Promoter Capture Hi-C: Promoter capture Hi-C (PCHi-C) allows the genome-wide interrogation of physical interactions between distal DNA regulatory elements and gene promoters in multiple tissue contexts.
- 1000 Genomes Project: The 1000 Genomes Project ran between 2008 and 2015, creating the largest public catalogue of human variation and genotype data.
- Array Express: an NIH-funded database at the European Molecular Biology Laboratory -European Bioinformatics Institute that collects and disseminates microarray-based gene-expression data.
- DDBJ: DNA Data Bank of Japan (DDBJ) is a data bank organized by the National Institute of Genetics in Japan that collects sequence data.
- ENCODE: The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.
- GEO: GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted.
- GermOnline: The GermOnline 4.0 gateway is a cross-species microarray expression database focusing on germline development, meiosis and gametogenesis as well as the mitotic cell cycle.
- Roadmap Epigenomics Project: The NIH Roadmap Epigenomics Mapping Consortium was launched with the goal of producing a public resource of human epigenomic data to catalyze basic biology and disease-oriented research.
- Expression Atlas: The Expression Atlas provides information on gene expression patterns under different biological conditions such as a gene knock out, a plant treated with a compound, or in a particular organism part or cell. @Ref
- ExAC: The Exome Aggregation Consortium (ExAC) is a coalition of investigators seeking to aggregate and harmonize exome sequencing data from a wide variety of large-scale sequencing projects, and to make summary data available for the wider scientific community.
- WTCCC: The Wellcome Trust Case Control Consortium (WTCCC) was established with an aim to harness the power of newly-available genotyping technologies to improve our understanding of the aetiological basis of several major causes of global disease.
- CommonMind Consortium: The CMC is generating data across multiple regions from >1000 postmortem brain samples from donors with Schizophrenia, Bipolar disease and individuals with no neuropsychiatric disorders - originating from tissue collections at four brain banks. Data consists of DNA and RNA sequencing, genotyping and epigenetics.
- TOPMed: The NHLBI Trans-Omics for Precision Medicine (TOPMed) program will support the Institute’s larger precision medicine activities by collecting and coupling whole-genome sequencing (WGS) and other -omics data (e.g., DNA methylation signature, RNA expression profiles, metabolite profiles) with molecular, behavioral, imaging, environmental, and clinical data from studies focused on heart, lung, blood and sleep (HLBS) disorders.
- ReadSim: ReadSim is a fast and simple reads simulator to target long reads such as PacBio or Nanopore.
- simNGS: simNGS is software for simulating observations from Illumina sequencing machines using the statistical models behind the AYB base-calling software.
- Trimmomatic: A flexible read trimming tool for Illumina NGS data.
- Sickle: A windowed adaptive trimming tool for FASTQ files using quality.
- famas: Yet another program for FastQ massaging with features: Quality- and length-based trimming, Random sampling, Splitting into multiple files, Order checking for paired-end files, Native gzip support.
- PiCard MarkDuplicates: This tool locates and tags duplicate reads in a BAM or SAM file, where duplicate reads are defined as originating from a single fragment of DNA.
- sambamba-markdup: Find duplicate reads in BAM file.
- bwa: BWA is a software package for mapping low-divergent sequences against a large reference genome, such as the human genome.
- bowtie2: Bowtie 2 is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences.
- ABRA: ABRA is a realigner for next generation sequencing data. It uses localized assembly and global realignment to align reads more accurately, thus improving downstream analysis (detection of indels and complex variants in particular). @Ref
- NextGenMap: NextGenMap (NGM) is a flexible and fast read mapping program that is more than twice as fast as BWA, while achieving a mapping sensitivity similar to Stampy or Bowtie2. @Ref
- ClinQC: ClinQC is an integrated and user-friendly pipeline for quality control, filtering and trimming of Sanger and NGS sequencing data for hundred to thousands of samples/patients in a single run in clinical research.
- NGS QC Toolkit: NGS QC Toolkit: A toolkit for the quality control (QC) of next generation sequencing (NGS) data.
Peak calling is a computational method used to identify areas in a genome that have been enriched with aligned reads as a consequence of performing a ChIP-sequencing or MeDIP-seq experiment. These areas are those where a protein interacts with DNA.
Differential peak calling is about identifying significant differences in two ChIP-seq signals.
- MACS: Model-based analysis of ChIP-seq (MACS) is a computational algorithm that identifies genome-wide locations of transcription/chromatin factor binding or histone modification from ChIP-seq data.
- DBChIP: detects differentially bound sharp binding sites across multiple conditions, with or without matching control samples.
- MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets.
- THOR: Differential peak calling of ChIP-seq signals with replicates. @Ref
- ODIN: ODIN is an HMM-based approach to detect and analyse differential peaks in pairs of ChIP-seq data. ODIN performs genomic signal processing, peak calling and p-value calculation in an integrated framework. @Ref
- MMDiff: This package detects statistically significant difference between read enrichment profiles in different ChIP-Seq samples. @Ref
- RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome@Ref
- Limma: Linear Models for Microarray and RNA-Seq Data.
- edgeR: Empirical Analysis of Digital Gene Expression Data in R.
- DESeq: Differential gene expression analysis based on the negative binomial distribution.
- Cufflinks: Transcriptome assembly and differential expression analysis for RNA-Seq.
- MISO: MISO (Mixture-of-Isoforms) is a probabilistic framework that quantitates the expression level of alternatively spliced genes from RNA-Seq data, and identifies differentially regulated isoforms or exons across samples. @Ref
- CHiCAGO: CHiCAGO is a set of tools for calling significant interactions in Capture HiC data, such as Promoter Capture HiC. @Ref
- CENTIPEDE: CENTIPEDE applies a hierarchical Bayesian mixture model to infer regions of the genome that are bound by particular transcription factors. @Ref
- FaSD: a fast and accurate single-nucleotide polymorphism detection program that uses a binomial distribution-based algorithm and a mutation probability.
- SOAPsnp: SOAPsnp uses a method based on Bayes’ theorem (the reverse probability model) to call consensus genotype by carefully considering the data quality, alignment, and recurring experimental errors.
- SNVmix: SNVMix is designed to detect single nucleotide variants from next generation sequencing data.
- CNVnator: a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads.
- bcftools: utilities for variant calling and manipulating VCFs and BCFs.
- GATK: Genome Analysis Toolkit offers a wide variety of tools with a primary focus on variant discovery and genotyping.
- Bambino: a variant detector and alignment viewer for next-generation sequencing data in the SAM/BAM format.
- CONSERTING: integrating copy-number analysis with structural-variation detection. @Ref
- CREST: CREST (Clipping Reveals Structure) is a new algorithm for detecting genomic structural variations at base-pair resolution using next-generation sequencing data. @Ref
- Control-FREEC: Control-FREEC is a tool for detection of copy-number changes and allelic imbalances (including LOH) using deep-sequencing data.
- HMMcopy: Copy number prediction with correction for GC and mappability bias for HTS data
- SegSeq: an algorithm to identify chromosomal breakpoints using massively parallel sequence data.
- CNV-seq: a new method to detect copy number variation using high-throughput sequencing.
- BICseq2: BICseq2 is an algorithm developed for the normalization of high-throughput sequencing (HTS) data and detection of copy number variations (CNV) in the genome. BICseq2 can be used for detecting CNVs with or without a control genome.
- MuSE: a novel approach to mutation calling based on the F81 Markov substitution model for molecular evolution, which models the evolution of the reference allele to the allelic composition of the matched tumor and normal tissue at each genomic locus. @Ref
- VarScan: a platform-independent software tool developed at the Genome Institute at Washington University to detect variants in NGS data.
- Pindel: Pindel can detect breakpoints of large deletions, medium sized insertions, inversions, tandem duplications and other structural variants at single-based resolution from next-gen sequence data.
- COPS: A Sensitive and Accurate Tool for Detecting Somatic Copy Number Alterations Using Short-Read Sequence Data from Paired Samples. COPS is available at ftp://115.119.160.213 with username “cops” and password “cops”. @Ref
- multiSNV: multiSNV is a tool for calling somatic single-nucleotide variants (SNVs) using NGS data from a normal and multiple tumour samples of the same patient. @Ref
- SomaticSeq: SomaticSeq is a flexible post-somatic-mutation-calling workflow for improved accuracy. @Ref
- Cosmos: COSMOS can detect somatic structural variations from whole genome short-read sequences. @Ref
- Platypus: Platypus is a tool designed for efficient and accurate variant-detection in high-throughput sequencing data.
- SnpSift: SnpSift is a toolbox that allows you to filter and manipulate annotated files.
- Varapp: Varapp is an open-source web application to filter variants from large sets of exome data stored in a relational database.
- ANNOVAR: is an efficient software tool to utilize update-to-date information to functionally annotate genetic variants detected from diverse genomes (including human genome hg18, hg19, hg38, as well as mouse, worm, fly, yeast and many others).
- SnpEff: Genetic variant annotation and effect prediction toolbox. @Ref
- GEMINI: GEMINI (GEnome MINIng) is a flexible framework for exploring genetic variation in the context of the wealth of genome annotations available for the human genome. @Ref
- Variant Effect Predictor: Analyse your own variants and predict the functional consequences of known and unknown variants via our Variant Effect Predictor (VEP) tool. @Ref
- VAT - Variant Annotation Tool: A computational framework to functionally annotate variants in personal genomes using a cloud-computing environment. @Ref
- SeattleSeq Variation Annotation: The SeattleSeq Annotation server provides annotation of SNVs (single-nucleotide variations) and small indels, both known and novel. @Ref
- Jannovar: A Java Library for Exome Annotation.
- Cellbase: CellBase is a scalable and high-performance NoSQL database that integrates relevant biological information from well-known data sources such as Ensembl, Uniprot, IntAct or ClinVar among others. All this data can be queried through a comprehensive RESTful web services API or using the command line interface. Also, a built-in variant annotator has been developed and can be used to annotate files containing variants in Variant Call Format (VCF). @Ref
- GenomeD3Plot: GenomeD3Plot (formerly Islandplot) is an SVG based genome viewer written in javascript using D3.
- Variant Tools: variant tools is a software tool for the manipulation, annotation, selection, simulation, and analysis of variants in the context of next-gen sequencing analysis. @Ref
- GWAVA: GWAVA is a tool which aims to predict the functional impact of non-coding genetic variants based on a wide range of annotations of non-coding elements (largely from ENCODE/GENCODE), along with genome-wide properties such as evolutionary conservation and GC-content.
- Mutsig: MutSig analyzes lists of mutations discovered in DNA sequencing, to identify genes that were mutated more often than expected by chance given background mutation processes. @Ref
- PHASE: A program for reconstructing haplotypes from population data. PHASE was limited by its speed and was not applicable to datasets from genome-wide association studies. @Ref
- fastPHASE: fastPHASE is software that implements methods for estimating missing genotypes and reconstructing haplotypes from unphased SNP genotype data of unrelated individuals. @Ref
- IMPUTE2: IMPUTE version 2 (also known as IMPUTE2) is a genotype imputation and haplotype phasing program based on ideas from Howie et al. 2009
- UCSC Genome Browser: The UCSC Genome Browser is an on-line genome browser hosted by the University of California, Santa Cruz (UCSC).
- JBrowse: a JavaScript genome browser by the open-source Generic Model Organism Database project. @Ref
- Synthesis-View: A SNP visualization tool. @Ref
- IGV - Integrative Genomics Viewer: A high-performance visualization tool for interactive exploration of large, integrated genomic datasets. @Ref
- pileup.js: a Browser-based Genome Viewer.
- Biodalliance: Biodalliance is a fast, interactive, genome visualization tool that's easy to embed in web pages and applications.
- TADkit: TADkit is a HTML5 and JavaScript-based 3D genome browser. It makes use of D3.js for rendering the 1D and 2D tracks and WebGl by Three.js for rendering the 3D track.
- ngs.plot: ngs.plot is a program that allows you to easily visualize your next-generation sequencing (NGS) samples at functional genomic regions. @Ref
- CHiCP: a web-based tool for the integrative and interactive visualization of promoter capture Hi-C datasets. @Ref
- WashU Epigenome Browser: An Epigenome browser.
- nextflow: Nextflow is a fluent DSL modelled around the UNIX pipe concept, that simplifies writing parallel and scalable pipelines in a portable manner.
- RUbioSeq+: RUbioSeq+ is a stand-alone and multiplatform application for the integrated analysis of NGS data. More specifically, our software implements pipelines for the analysis of single nucleotide and copy-number variation, bisulfite-seq and ChIP-seq experiments using well-established tools to perform these common tasks.
- SpeedSeq: A flexible framework for rapid genome analysis and interpretation. @Ref
- HICUP: pipeline for mapping and processing Hi-C data. @Ref
- OpEx: Provides a fixed implementation of alignment, calling and annotation tools optimized for individual or multiple exome sequencing analysis in the research or clinical setting. @Ref
- BAM: BAM is the compressed binary version of the Sequence Alignment/Map (SAM) format, a compact and index-able representation of nucleotide sequence alignments.
- BED: The BED format consists of one line per feature, each containing 3-12 columns of data, plus optional track definition lines. @UCSC, @bedtools, @Ensembl
- BigBed: The bigBed format stores annotation items that can either be simple, or a linked collection of exons, much as BED files do. BigBed files are created initially from BED type files, using the program bedToBigBed. The resulting bigBed files are in an indexed binary format. @UCSC
- BigWig: The BigWig format is designed for dense, continuous data that is intended to be displayed as a graph. Files can be created from WIG or BedGraph files using the appropriate utility program. @UCSC
- GFF: The GFF (General Feature Format) format consists of one line per feature, each containing 9 columns of data, plus optional track definition lines. @UCSC, @Ensembl, GTF(GFFv2)@GMOD, @Wiki
- WIG: The WIG (wiggle) format is designed for display of dense continuous data such as probability scores. Wiggle data elements must be equally sized; if you need to display continuous data that is sparse or contains elements of varying size, use the BedGraph format instead. @UCSC, @Ensembl
- SAM: The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. @Wiki
- VCF: The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. v4.0@1000genomes, @Wiki
- MAF - Mutation Annotation Format: A Mutation Annotation Format (MAF) file (.maf) is a tab-delimited text file that lists mutations. Tutorial@Biostars
- pileup: Pileup format is a text-based format for summarizing the base calls of aligned reads to a reference sequence. @Wiki
- fasta: FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which nucleotides or amino acids are represented using single-letter codes. @Wiki
- BEDOPS: the fast, highly scalable and easily-parallelizable genome analysis toolkit.
- bedtools: Collectively, the bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks.
- bwtool: bwtool is a command-line utility for bigWig files.
- sambamba: samtools functions with multi-threading support.
- samtools: SAM Tools provide various utilities for manipulating alignments in the SAM/BAM format, including sorting, merging, indexing and generating alignments in a per-position format.
- VCFtools: A set of tools written in Perl and C++ for working with VCF files.
- PiCard: Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
- SVTools: Tools for processing and analyzing structural variants.
- T Test: A t-test is any statistical hypothesis test in which the test statistic follows a Student's t-distribution under the null hypothesis. It can be used to determine if two sets of data are significantly different from each other. @Wiki
- Chi-Squared Test: also referred to as a {\displaystyle \chi ^{2}} \chi ^{2} test, is any statistical hypothesis test wherein the sampling distribution of the test statistic is a chi-square distribution when the null hypothesis is true. @Wiki
- GSEA - Gene Set Enrichment Analysis: Gene Set Enrichment Analysis (GSEA) is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).
- fgsea: An R-package for fast preranked gene set enrichment analysis (GSEA).
- SVM - Support Vector Machine: support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. @Wiki
- Biclustering: Biclustering, block clustering, co-clustering, or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. @Wiki