Lab 5: Computational Motif Discovery

Objective

This lab is to guide you through the implemention of Naive consensus algorithm and Gibbs sampling algorithm for computational motif detection.

Introduction

Transcription factors (TFs) regulate the gene expression by activating or inhibiting the transcription machinery. Understanding the mechanisms underlying the gene expression is a major challenge in molecular biology. Therefore, identifying regulatory elements, especially the TFBSs becomes an essential task in transcripteomics research.

In essence, motif discovery is a pattern discovery problem. This problem can be simplified as follows:

Given a set of DNA sequences, find an unknown pattern that occurs frequently.

A DNA motif is defined as a nucleic acid sequence pattern that has some biological significance. Normally, the pattern is fairly short (5 to 20 bp long) and is known to recur in different genes or several times within a gene.

Algorithms to find regulatory elements can be categorized into two groups:

(1) methods based on word counting and
(2) methods based on probabilistic sequence models.

The word counting methods analyze the frequency of oligonucleotides in the upstream regions of the genes of interest and use intelligent strategies to speed up counting and to detect significantly over-represented motifs. These methods then compile a common motif by grouping similar words. Word counting methods lead to a global solution as compared to the probabilistic methods and are appropriate for short motifs and are therefore useful for motif finding in eukaryotic genome where motifs are generally shorter than prokaryotes. These algorithms are a good choice for finding totally constrained motifs, i.e., all instances are identical. However, for typical motifs that often have several weakly conserved positions, word-based methods can be problematic and results often has to be post-processed. In addition, word-based methods also suffer from too many spurious discoveries.

In probabilistic approach, the model parameters are often estimated using Maximum-Likelihood principle or Bayesian inference. Many of the algorithms developed from the probabilistic approach are designed to find longer or more general motifs than are required for transcription factor binding sites. Therefore, they are more suitable for prokaryotic motif discovery. Unfortunately, these algorithms do not guarantee to finding global optima, since some local search methods (e.g. Gibbs sampling, Expectation-Maximization, or greedy algorithms) are employed, which may converge to local optima.

Position Weighted Matrix

For the $n$ $k$-mers, we can convert it into a PWM matrix: $$ \mathbf{W} = \begin{bmatrix} w_{11} & \cdots & w_{1j} & \cdots & w_{1k}\\ \vdots & \ddots & \vdots & \ddots & \vdots\\ w_{i1} & \cdots & w_{ij} & \cdots & w_{ik}\\ \vdots & \ddots & \vdots & \ddots & \vdots\\ w_{m1} & \cdots & w_{mj} & \cdots & w_{mk} \end{bmatrix} $$ where $m$ is the size of possible alphabets, while $w_{ij}$ is the probability of the alphabet $i$ at position $j$. Hence, the information content of the PWM is: $$ I(\mathbf{W}) = -\sum_{j=1}^k \sum_{i=1}^m w_{ij} \log_2 \frac{w_{ij}}{q_i} $$ where $q_i$ is the background probability of alphabet $i$.

Algorithm 1: Naive consensus

This algorithm finds a position weight matrix (PWM) motif in a set of input DNA sequences.

Input:
- a set of sequences $S={s_1, s_2, \dots, s_n}$ stored in a FASTA file;
- the desired motif length, $k$
- number of motifs to report, $T$
Output:
- a set of motifs $M={m_1, m_2, \dots, m_T}$, where $m_i$ is $n$ $k$-mers, each is the subsequence of $s$.
Procedure:
- For each pair of $k$-mers $x_1$ and $x_2$ from $s_1$ and $s_2$, compute the information content of the corresponding PWM. Store the $T$ highest PWMs.
- For $i=3,\dots,n$:
  - : For every $k$-mer $x_i$ (substring of $s_i$), add $x_i$ to each PWMs to obtain a new motif, and compute the information content of the corresponding PWM. Store the T highest scoring PWMs.
- Output the T highest scoring PWMs as the final motif candidates.

Algorithm 2: Gibbs sampling

Procedure:

Set $(x_1, \dots, x_n)$ as the random start positions for the motif in the input sequences. That is, $x_i \in [1, |s_i|-k+1]$
Repeat until $(x_1, \dots, x_n)$ not change
- For $i=1,\dots,n$
  - build the profile $\mathbf{Q}$ with $k$-mers starting at $(x_1, \dots, x_{i-1}, x_{i+1}, \dots, x_n)$
  - update $x_i$ according to the profile probability distribution of $\mathbf{Q}$ in $s_i$
Output the start sites $(x_1, \dots, x_n)$

$\S$Exercises

Implement the above algorithms in C or Python.
Run your programs on the following toy data, to identify the possible motifs in the set of sequences using the parameters of $k=9$ and the background base distribution $q_A=q_T=0.35, q_C=q_G=0.15$:

(1) course/bi462/lab5/bicoid.fa
(2) course/bi462/lab5/hunchback.fa
(3) course/bi462/lab5/kruppel.fa

Use the weblogo to obtain the graphical representation of the above motifs.
What is the difference between the above two algorithms, based on your results?
Are there any drawbacks for the above approaches? If yes, can you figure out how to improve the algorithm?
Web-based MEME search of the motifs on the above datasets

Navigate to MEME-Suite and select MEME from the Motif Discovery section on the left
Upload your masked sequence file
Set the number of motifs to be found to 5
Run the search
View the HTML output

Extra exercise

Download MEME-Suite
Unpack and Intall the suite, figure out how to use the tool for motif discovery.
Read the accompanied reference article and also the source code, then comment on how expectation-maximization (EM) algorithm is used to find the motifs in a set of sequences.

On the way to the garden of bioinformatics.

A bioinformatics wiki for the course BI462.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly