Skip to content

Commit

Permalink
add initial BLAST introduction page
Browse files Browse the repository at this point in the history
  • Loading branch information
widdowquinn committed Feb 27, 2024
1 parent e6d4408 commit d628e84
Show file tree
Hide file tree
Showing 16 changed files with 262 additions and 0 deletions.
1 change: 1 addition & 0 deletions _quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ book:
chapters:
- index.qmd
- intro.qmd
- blast.qmd
- summary.qmd
- glossary.qmd
- references.qmd
Expand Down
Binary file added assets/images/blast-specialised.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/fig-ncbi-organism.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-blast-button.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-blast-landing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-blast-progress.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-blast-results.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-blastn-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-blastn.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-homo-sapiens.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/images/ncbi-specialised-db.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added assets/papers/1-s2.0-S0022283605803602-main.pdf
Binary file not shown.
Binary file added assets/papers/1471-2105-10-421.pdf
Binary file not shown.
Binary file added assets/papers/25-17-3389.pdf
Binary file not shown.
121 changes: 121 additions & 0 deletions blast.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# BLAST

> The `BLAST` (Basic Local Alignment Search Tool) software suite provides a set of tools for comparing a biological [query sequence](glossary.qmd#query-sequence) to the sequences in a database, and returning all those sequences from the database that resemble the query sequence above a defined threshold similarity level.
::: { .callout-note }
The original `BLAST` paper is one of the most highly cited publications of all time, and has [over 110,000 citations in the literature](https://scholar.google.co.uk/scholar?cites=13256359684734893683&as_sdt=2005&sciodt=0,5&hl=en).
:::

::: { .callout-tip collapse="true"}
## A brief history of the `BLAST` software suite

- 1990: `BLAST` is first described (@Altschul1990-ar).
- 1997: A refined version of `BLAST` is published, introducing a new way of managing gapped alignments and `PSI-BLAST` - a method of compiling a _profile_ of similar sequences to make searching more sensitive (@Altschul1997-vt).
- 2000: The `MegaBLAST` algorithm for fast alignment-based searching of large nucleotide sequences is proposed, and incorporated into the `BLAST` suite (@Zhang2000-wn).
- 2009: A completely rewritten version of the software suite, `BLAST+` is released. This improved performance and changed many features of the algorithms used for searching and building databases (@Camacho2009-za).

The latest updates to the `BLAST` software are described on the [BLAST news page](https://blast.ncbi.nlm.nih.gov/doc/blast-news/2023-BLAST-News.html)
:::

::: { .callout-important }
**In the exercises that follow, you will carry out BLAST searches using the provided sequences, and use the results to answer the formative assessment on MyPlace.**
:::

## Performing a `BLAST` search

This section descbribes a general `BLAST` query using the [NCBI BLAST server](https://blast.ncbi.nlm.nih.gov/Blast.cgi). It is intended as a reference guide for you to return to as you get used to querying `BLAST` through this interface.

::: { .callout-caution }
There are implementations of `BLAST` or other sequence search methods at many other databases, and they may present a different interface and choice of options, or even a totally different search method. For example, the `RCSB-PDB` protein structure database offers a [sequence search page](https://www.rcsb.org/search/advanced/sequence) which uses the `mmseqs2` search algorithm (@Steinegger2017-sy).
:::

### Navigate to the NCBI `BLAST` webserver

Open a web browser and navigate to the [NCBI BLAST webserver](https://blast.ncbi.nlm.nih.gov/Blast.cgi). You should see a landing page that resembles @fig-ncbi-blast-landing.

![The landing page of the NCBI `BLAST` webserver.](assets/images/ncbi-blast-landing.png){#fig-ncbi-blast-landing}

::: { .callout-tip collapse="true" }
## You can use an NCBI account to save searches

Note that there is a `Log in` button at the top right of the landing page. If you have a suitable account with NCBI/NIH, NCBI maintains a record of your searches and history so you can store your searches and retrieve them later.

- [Sign up](https://account.ncbi.nlm.nih.gov/signup/) for a free NCBI account
:::

### Select the `BLAST` tool you want to use

The `BLAST` suite provides search tools for finding matches to a query in a database. The query can be either a nucleotide or a protein sequence, and the database being searched can contain either protein sequences or nucleotide sequences. `BLAST` provides four different programs to carry out these combinations of search.

| Query type | nucleotideDB | proteinDB |
|---|:-:|:-:|
| nucleotide | `blastn` | `blastx` |
| protein | `tblastn` | `blastp` |

: The four main `BLAST` programs, and the combination of query/database sequence type they are used for {#tbl-blast-programs .striped .hover}

::: { .callout-note collapse="true" }
## Specialised `BLAST` tools

The NCBI `BLAST` webserver provides specialised search options with specific combinations of parameters and databases pre-selected to support particular kinds of search (@fig-blast-specialised).

![Specialised `BLAST` search options are available at the NCBI `BLAST` webserver](assets/images/blast-specialised.png){#fig-blast-specialised}

:::

Select `Nucleotide BLAST` from the NCBI landing page, to get to the `blastn` search page (@fig-ncbi-blastn).

![The NCBI `blastn` webservice search page](assets/images/ncbi-blastn.png){#fig-ncbi-blastn}

### Enter the query sequence

Copy the DNA sequence below, and paste it into the box marked **
Enter accession number(s), gi(s), or FASTA sequence(s)** at the NCBI search page (@fig-ncbi-blastn-query).

```text
ATGCGTCGAGGGCGTCTGCTGGAGATCGCCCTGGGATTTACCGTGCT
TTTAGCGTCCTACACGAGCCATGGGGCGGACGCCAATTTGGAGGC
TGGGAACGTGAAGGAAACCAGAGCCAGTCGGGCC
```

![The NCBI `blastn` search page with a query sequence pasted into the query sequence field.](assets/images/ncbi-blastn-query.png){#fig-ncbi-blastn-query}

### Set appropriate parameter choices

::: { .callout-caution }
If you make no more changes to the parameter settings for your search, the default options will be used. Your query will be made against the `nr/nt` complete nucleotide collection, a very large database. Due to the size of the database, the search may take a relatively long time.
:::

::: { .callout-tip collapse="true"}
## Improving your searches by changing parameters

You can make your `BLAST` searches quicker, and more relevant to your biological question, if you can use information about your sequence and the type of organism you want to search.

NCBI `BLAST` offers a number of smaller specialised databases with particular sequence types (e.g. RNA databases, sequences of protein structures, etc.) (@fig-ncbi-specialised-db).

![A list of specialised sequence databases offered by the NCBI `BLAST` webserver](assets/images/ncbi-specialised-db.png){#fig-ncbi-specialised-db}

You can also narrow down the search by specifying an organism, or other taxonomic rank, using the `Organism` field (@fig-ncbi-organism).

![Taxonomic options offered by the NCBI `BLAST` search organism field, for "Pseudomonas"](assets/images/fig-ncbi-organism.png){#fig-ncbi-organism}
:::

Restrict the sequences being searched by typing "Homo sapiens" in the `Organism` field and selecting the appropriate option from the drop-down list (@fig-ncbi-homo-sapiens).

![Taxonomic options offered by the NCBI `BLAST` search organism field, for "Homo sapiens"](assets/images/ncbi-homo-sapiens.png){#fig-ncbi-homo-sapiens}

### Run the `BLAST` search

Click on the `BLAST` button (@fig-ncbi-blast-button).

![The NCBI `BLAST` webserver `BLAST` button. Click this to start the search.](assets/images/ncbi-blast-button.png){#fig-ncbi-blast-button}

### Wait for the search to complete

While the search runs, you will see a holding page that updates you with progress (@fig-ncbi-blast-progress)

![An NCBI `BLAST` webserver progress page.](assets/images/ncbi-blast-progress.png){#fig-ncbi-blast-progress}

When the search is complete, you will see the `blastn` results page (@fig-ncbi-blast-results).

![An NCBI `BLAST` results page, for a `blastn` query.](assets/images/ncbi-blast-results.png){#fig-ncbi-blast-results}
140 changes: 140 additions & 0 deletions references.bib
Original file line number Diff line number Diff line change
Expand Up @@ -211,3 +211,143 @@ @uvic.cat
year = 2013,
language = "en"
}

@ARTICLE{Altschul1990-ar,
title = "Basic local alignment search tool",
author = "Altschul, S F and Gish, W and Miller, W and Myers, E W and
Lipman, D J",
abstract = "A new approach to rapid sequence comparison, basic local
alignment search tool (BLAST), directly approximates alignments
that optimize a measure of local similarity, the maximal segment
pair (MSP) score. Recent mathematical results on the stochastic
properties of MSP scores allow an analysis of the performance of
this method as well as the statistical significance of
alignments it generates. The basic algorithm is simple and
robust; it can be implemented in a number of ways and applied in
a variety of contexts including straightforward DNA and protein
sequence database searches, motif searches, gene identification
searches, and in the analysis of multiple regions of similarity
in long DNA sequences. In addition to its flexibility and
tractability to mathematical analysis, BLAST is an order of
magnitude faster than existing sequence comparison tools of
comparable sensitivity.",
journal = "J. Mol. Biol.",
publisher = "Elsevier BV",
volume = 215,
number = 3,
pages = "403--410",
month = oct,
year = 1990,
language = "en"
}

@ARTICLE{Altschul1997-vt,
title = "Gapped {BLAST} and {PSI-BLAST}: a new generation of protein
database search programs",
author = "Altschul, S F and Madden, T L and Sch{\"a}ffer, A A and Zhang, J
and Zhang, Z and Miller, W and Lipman, D J",
abstract = "The BLAST programs are widely used tools for searching protein
and DNA databases for sequence similarities. For protein
comparisons, a variety of definitional, algorithmic and
statistical refinements described here permits the execution
time of the BLAST programs to be decreased substantially while
enhancing their sensitivity to weak similarities. A new
criterion for triggering the extension of word hits, combined
with a new heuristic for generating gapped alignments, yields a
gapped BLAST program that runs at approximately three times the
speed of the original. In addition, a method is introduced for
automatically combining statistically significant alignments
produced by BLAST into a position-specific score matrix, and
searching the database using this matrix. The resulting
Position-Specific Iterated BLAST (PSI-BLAST) program runs at
approximately the same speed per iteration as gapped BLAST, but
in many cases is much more sensitive to weak but biologically
relevant sequence similarities. PSI-BLAST is used to uncover
several new and interesting members of the BRCT superfamily.",
journal = "Nucleic Acids Res.",
publisher = "Oxford University Press (OUP)",
volume = 25,
number = 17,
pages = "3389--3402",
month = sep,
year = 1997,
language = "en"
}

@ARTICLE{Zhang2000-wn,
title = "A greedy algorithm for aligning {DNA} sequences",
author = "Zhang, Z and Schwartz, S and Wagner, L and Miller, W",
abstract = "For aligning DNA sequences that differ only by sequencing
errors, or by equivalent errors from other sources, a greedy
algorithm can be much faster than traditional dynamic
programming approaches and yet produce an alignment that is
guaranteed to be theoretically optimal. We introduce a new
greedy alignment algorithm with particularly good performance
and show that it computes the same alignment as does a certain
dynamic programming algorithm, while executing over 10 times
faster on appropriate data. An implementation of this algorithm
is currently used in a program that assembles the UniGene
database at the National Center for Biotechnology Information.",
journal = "J. Comput. Biol.",
publisher = "Mary Ann Liebert Inc",
volume = 7,
number = "1-2",
pages = "203--214",
month = feb,
year = 2000,
language = "en"
}

@ARTICLE{Camacho2009-za,
title = "{BLAST+}: architecture and applications",
author = "Camacho, Christiam and Coulouris, George and Avagyan, Vahram and
Ma, Ning and Papadopoulos, Jason and Bealer, Kevin and Madden,
Thomas L",
abstract = "BACKGROUND: Sequence similarity searching is a very important
bioinformatics task. While Basic Local Alignment Search Tool
(BLAST) outperforms exact methods through its use of heuristics,
the speed of the current BLAST software is suboptimal for very
long queries or database sequences. There are also some
shortcomings in the user-interface of the current command-line
applications. RESULTS: We describe features and improvements of
rewritten BLAST software and introduce new command-line
applications. Long query sequences are broken into chunks for
processing, in some cases leading to dramatically shorter run
times. For long database sequences, it is possible to retrieve
only the relevant parts of the sequence, reducing CPU time and
memory usage for searches of short queries against databases of
contigs or chromosomes. The program can now retrieve masking
information for database sequences from the BLAST databases. A
new modular software library can now access subject sequence
data from arbitrary data sources. We introduce several new
features, including strategy files that allow a user to save and
reuse their favorite set of options. The strategy files can be
uploaded to and downloaded from the NCBI BLAST web site.
CONCLUSION: The new BLAST command-line applications, compared to
the current BLAST tools, demonstrate substantial speed
improvements for long queries as well as chromosome length
database sequences. We have also improved the user interface of
the command-line applications.",
journal = "BMC Bioinformatics",
publisher = "Springer Science and Business Media LLC",
volume = 10,
number = 1,
pages = "421",
month = dec,
year = 2009,
language = "en"
}

@ARTICLE{Steinegger2017-sy,
title = "{MMseqs2} enables sensitive protein sequence searching for the
analysis of massive data sets",
author = "Steinegger, Martin and S{\"o}ding, Johannes",
journal = "Nat. Biotechnol.",
publisher = "Springer Science and Business Media LLC",
volume = 35,
number = 11,
pages = "1026--1028",
month = nov,
year = 2017,
language = "en"
}

0 comments on commit d628e84

Please sign in to comment.