intro.qmd

---
title: "Introduction"
date: "August 21, 2023"
date-modified: "`r Sys.Date()`"
format: 
  html:
    page-layout: full
toc: true
toc-location: right
toc-depth: 2
number-sections: true
number-depth: 2
link-external-icon: true
link-external-newwindow: true 
bibliography: references.bib
editor: 
  markdown: 
    wrap: 80
---

```{r echo=FALSE, output=FALSE}

library(webexercises)
```

# Goals {data-link="Preface"}

[Structural
Bioinformatics](https://en.wikipedia.org/wiki/Structural_bioinformatics "Wikipedia")
(SB) is a broad discipline that encompasses data resources, algorithms, and
tools for investigating, analyzing, predicting, and interpreting
biomacromolecular structures [@paiva2022]. In this course, we will specifically
focus on protein structural bioinformatics, including visualization and analysis
of the structure of biomacromolecules as well as the prediction of protein
structures and complexes. The premise of SB is that high-resolution structural
information about biological systems enables precise reasoning about their
functions and the effects of modifications and perturbations.

The goals of SB require at least four different research lines (see Chapter 1 in
@structur):

1.  *Visualization*: Dealing with one or many complex structures and integrating
    various sources of information such as sequences, structural data,
    electrostatic fields, locations of functional sites, and areas of
    variability.

2.  *Classification*: Grouping similar structures hierarchically to identify
    common origins and diversification paths. Similar to other fields of biology
    classification is tedious but required to understand the structural space.

3.  *Prediction* of structures remains an area of keen interest and a field of
    research itself. As we will see below, the number of different sequences is
    much higher than the availability of structures, which make prediction an
    essential and useful tool.

4.  *Simulation.* Experimentally obtained structures are primarily static
    structural models (see warning below). However, the properties of these
    molecules are often the results of their dynamic motions. The definition of
    energy functions that govern the folding of proteins and their subsequent
    stable dynamics can be analyzed by molecular dynamics simulations, although
    computation capacities may be limiting to reach a biologically relevant
    timescales.

Powered by vast amounts of data and significant technical advances, the field
has undergone a substantial transformation over the past twenty years The
enhancement of experimental capabilities to analyze the structure of proteins
and other biological molecules and structures (see @callaway2020) and the
advancement of Artificial Intelligence (AI)-assisted structure prediction have
substantially increased the ability of life-science researchers to address
various questions concerning protein diversity, evolution, and function. This
transformation has boosted in the last 5 years, and its implications for
biology, biotechnology, and biomedicine remain largely unpredictable.

# Before going forward: Protein Structure 101 {#sec-str}

While it is possible to perform some protein modeling without expertise in
structural biology, having a basic understanding of protein structure is highly
recommended. In this course, there are often students who do not have a strong
background in biology. Additionally, it has been observed that graduate students
in biology, biomedicine, and related fields possess varying levels of knowledge
about protein structure. For those looking to review or update their
understanding of protein structure, it is suggested to read Chapter 2 of
@structur, the great review by @stollar2020 and the
[Wikipedia](https://en.wikipedia.org/wiki/Protein_structure) and
[Proteopedia](https://proteopedia.org/wiki/index.php/Introduction_to_protein_structure)
articles on protein structures, which were the primary sources for this brief
section (follow picture links).

[![Protein structure levels, using human PCNA (PDB 1AXC) as an
example.](pics/Protein_structure.png "Protein Structure"){#fig-str
.figure}](https://en.wikipedia.org/wiki/Protein_structure)

Proteins are essential components of life, involved in various vital functions
as structural elements, scaffolding elements, or active enzymes that catalyze
metabolic reactions. Proteins are composed of polymers of amino acids, and the
sequence of amino acids of a particular protein is referred to as the **Primary
structure** of the protein. Amino acid chains can spontaneously fold into
three-dimensional structures, mainly stabilized by hydrogen bonds between amino
acids. The amino acid sequence determines different layers of 3D structure. Each
of the 20 natural amino acids has specific physicochemical properties that
influence its preferred conformation. Therefore, the initial level of folding is
known as the **Secondary structure**, which forms common patterns as will be
discussed further. These segments of secondary structure patterns are capable of
folding into three-dimensional forms due to interactions between the side chains
of amino acids, known as protein **Tertiary structure**. Additionally, two or
more individual peptide chains can aggregate to form multisubunit proteins,
referred to as having **Quaternary structure**.

[![Amino acids classification by
type](pics/aa.png "Amino acids clasification by type"){#fig-aa
.figure}](https://www.reddit.com/r/chemistry/comments/acyald/venn_diagram_showing_the_properties_of_the_20/)

It is important to note that the peptide bond itself does not permit rotation as
it possesses partial double bond characteristics. Hence, rotation is restricted
to the bonds between the `Cα` and the `C = O` group (the phi (φ) angle) and the
`Cα` and the `NH` group (the psi (ψ) angle). The polypeptide backbone chain thus
consists of a repeating sequence of two rotatable bonds followed by one
non-rotatable (peptide) bond. However, not all 360º of the φ and ψ angles are
feasible due to potential steric clashes between neighboring sidechains. For
specific angles and amino acid combinations, spatial constraints prevent atoms
from occupying the same physical location, which partially accounts for the
varying propensities of certain amino acids to adopt different types of
secondary structures.

[![Scheme of a generic polypeptide chain. Residue side chains are denoted as R.
Coloured rectangles indicate sets of six atoms that are coplanar due to the
double-bond character of the peptide bond. Arrows indicate the bonds that are
free to rotate with the angle of rotation about the N--Cα known as phi (φ) and
about the Cα--C known as psi (ψ). Note that only peptide backbone bonds are
labeled and in most cases the R group bond is free to
rotate.](pics/peptide_bond.png "Peptide bond"){#fig-bond
.figure}](https://portlandpress.com/essaysbiochem/article/64/4/649/226515/Uncovering-protein-structure)

Furthermore, the side chains of amino acids possess their own torsion angles,
known as χ1, χ2, χ3, and so forth (@fig-chi). These torsion angles significantly
influence the secondary and, particularly, the tertiary structures of proteins.
The various combinations of side-chain torsions defined by χ angles are referred
to as *rotamers*.

Within these constraints, the two primary local conformations that avoid steric
hindrance and maximize backbone-backbone hydrogen bonding are the α-helix and
the β-sheet secondary structures. [Linus
Pauling](https://en.wikipedia.org/wiki/Linus_Pauling) initially proposed the
**α-helix** as left-handed in 1951, but the myoglobin crystal structure in 1958
revealed that the right-handed form is more common. In the typical right-handed
helices, the backbone NH group hydrogen bonds to the backbone C=O group of the
amino acid located four residues earlier along the protein sequence. This
regular coil shape has the R-groups pointing outward, away from the peptide
backbone, and requires about 3.6 residues to complete a full turn of the helix
(@fig-alpha).

::: {layout-ncol="1"}
[![Alpha helix.](pics/alpha.png "Alpha helix"){#fig-alpha
.figure}](https://en.wikipedia.org/wiki/Alpha_helix)

[![Detailed description of a beta sheet made up of three beta
strands.](pics/beta.png "Beta sheet"){#fig-beta
.figure}](https://en.wikipedia.org/wiki/Beta_sheet)
:::

Different amino-acid sequences have varying tendencies to form α-helical
structures. [Methionine](https://en.wikipedia.org/wiki/Methionine "Methionine"),
[alanine](https://en.wikipedia.org/wiki/Alanine "Alanine"),
[leucine](https://en.wikipedia.org/wiki/Leucine "Leucine"),
[glutamate](https://en.wikipedia.org/wiki/Glutamate "Glutamate"), and
[lysine](https://en.wikipedia.org/wiki/Lysine "Lysine") have especially high
helix-forming propensities, whereas
[proline](https://en.wikipedia.org/wiki/Proline "Proline") and
[glycine](https://en.wikipedia.org/wiki/Glycine "Glycine") have poor
helix-forming propensities.
[Proline](https://en.wikipedia.org/wiki/Proline "Proline") often disrupts or
kinks a helix because it lacks an amide hydrogen to form [hydrogen
bonds](https://en.wikipedia.org/wiki/Hydrogen_bond "Hydrogen bond") and its
bulky side chain interferes with the preceding turn's backbone.
[Glycine](https://en.wikipedia.org/wiki/Glycine "Glycine") with only a hydrogen
as its R-group, is too flexible and entropically costly to maintain the
constrained α-helical structure, making it an α-helix breaker.

**β-Sheets** (@fig-beta) consist of two or more extended polypeptide chains
called β-strands running alongside each other in either a parallel or
antiparallel arrangement. In a β-sheet, residues arrange themselves in a zigzag
pattern with adjacent peptide bonds pointing in opposite directions. The NH
group and the C=O group of each amino acid form hydrogen bonds with the C=O
group and NH group, respectively, on adjacent strands. The chains can run in
opposite directions (*antiparallel β-sheet*) or the same direction (*parallel
β-sheet*). Sidechains from each residue alternate in opposite directions, giving
β-sheets hydrophilic and hydrophobic faces, often forming a pattern of
alternating hydrophilic and hydrophobic residues in the primary structure.

Large aromatic residues
([tyrosine](https://en.wikipedia.org/wiki/Tyrosine "Tyrosine"),
[phenylalanine](https://en.wikipedia.org/wiki/Phenylalanine "Phenylalanine"),
[tryptophan](https://en.wikipedia.org/wiki/Tryptophan "Tryptophan")) and
β-branched amino acids
([threonine](https://en.wikipedia.org/wiki/Threonine "Threonine"),
[valine](https://en.wikipedia.org/wiki/Valine "Valine"),
[isoleucine](https://en.wikipedia.org/wiki/Isoleucine "Isoleucine")) are
commonly found in β-strands. As in the case of α-helixes, β-strands are often
ended by [glycines](https://en.wikipedia.org/wiki/Glycine "Glycine"), which are
especially common in β-turns (the most common connector between strands), as
amino acids with positive φ angles.

[![Dihedral angles in glutamate: Dihedral angles are the main degrees of freedom
for the backbone (ϕ and ψ angles) and the side chain (χ angles) of an amino
acid. The number of χ angles varies between zero and four for the 20 standard
amino acids. The figure shows a ball-and-stick representation of glutamate,
which has three χ angles.](images/paste-6EE800AE.png){#fig-chi .figure
width="301"}](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-306)

Protein secondary, tertiary, and quaternary structures are maintained by
interactions between amino acids (@fig-interaction). These interactions are
typically classified into four types:

1.  *Ionic bonding*: Ionic bonds arise from electrostatic attractions between
    positively and negatively charged amino acid side chains. For example, the
    attraction between an aspartic acid carboxylate ion and a lysine ammonium
    ion helps stabilize a specific folded region of a protein.
2.  *Hydrogen bonding*: Hydrogen bonds form between a highly electronegative
    oxygen or nitrogen atom and a hydrogen atom attached to another oxygen or
    nitrogen atom, such as those in polar amino acid side chains. Hydrogen bonds
    are crucial for both intra- and intermolecular interactions in proteins,
    like in alpha-helixes.
3.  *Disulfide linkages*. When two cysteine amino acids come close together
    during protein folding under appropriate redox conditions, oxidation can
    link their sulfur atoms, forming a disulfide bond. Unlike ionic or hydrogen
    bonds, these are covalent links, and are thus a classic example of a
    spontaneus reaction, occurring as a post-translational modification.
    Although they are sensitive to reducing agents, greatly stabilize the
    tertiary structure and are vital for the quaternary structure of many
    proteins, like antibodies.
4.  *Hydrophobic interactions.* Dispersion forces arise when a normally nonpolar
    atom becomes momentarily polar due to an uneven distribution of electrons,
    leading to an instantaneous dipole that induces a shift of electrons in a
    neighboring nonpolar atom. Dispersion forces are weak but can be important
    when other types of interactions are either missing or minimal. The term
    *hydrophobic interaction* is often misused as a synonym for dispersion
    forces. Hydrophobic interactions arise because water molecules engage in
    hydrogen bonding with other water molecules (or groups in proteins capable
    of hydrogen bonding). Because nonpolar groups cannot engage in hydrogen
    bonding, the protein folds in such a way that these groups are buried in the
    interior part of the protein structure, minimizing their contact with water.

![Four interactions stabilize the tertiary structure of a protein: (a) ionic
bonding, (b) hydrogen bonding, (c) disulfide linkages, and (d) dispersion
forces](images/interactions-01.png){#fig-interaction .figure width="574"}

Other less frequent intramolecular interactions might be relevant in some
proteins, like the so-called **Isopeptide bonds**, formed between two protein
groups, at least one of which is not an α-amino or α-carboxy group. Examples
include ubiquitylation, sumoylation, transglutamination, sortase-mediated cell
surface protein anchoring and pilus formation [@kang2011]. As detailed in
@fig-iso, all these processes share several feature:

1.  All involve the reaction of a lysine ε-amino group on one protein with the
    main α-carboxy group on another protein, except for transglutamination,
    where the lysine targets a glutamine side chain carboxyamide group.
2.  All processes are enzyme-mediated and involve a transient thioester
    intermediate formed by the active site's cysteine. This intermediate is
    resolved through nucleophilic attack by the lysine ε-amino group, which
    completes the bond formation.

[![Enzyme-mediated formation of intermolecular isopeptide bonds. Examples are
shown for three different biological processes: ubiquitylation,
transglutamination and sortase-mediated pilus assembly in Gram-positive
bacteria. Proteins linked by isopeptide bonds are coloured blue and purple and
the isopeptide bond-forming enzymes are coloured green. Reproduced from
@kang2011.](images/iso.jpg){#fig-iso
.figure}](https://www.sciencedirect.com/science/article/pii/S0968000410001842)

In contrast to these enzyme-dependent processes, cross-linking isopeptide bonds
form autocatalytically in *S. pyogenes* major pilin Spy0128 and other Gram+ cell
surface proteins, as well as in HK97 phage capsid. In this case, the
bond-forming reaction is a proximity-induced reaction that happens when
participating amino acids are positioned together in a hydrophobic environment,
either through protein folding concurrent with peptide bond formation at the
ribosome or by capsid rearrangement (in HK97).

::: callout-tip
# Unnatural reactive amino acids

Manipulating inter- and intramolecular amino acid interactions is common for
altering protein properties. A classic example is adding nearby cysteines to
form disulfide bonds, enhancing stability. Recently, Spy0128 motifs and
unnatural reactive amino acids have enabled applications like creating protein
staples, thermostable antibodies, and surfaces/nanoparticles that capture
proteins [@cao2022].
:::

## Ramachandran Plot {#sec-rama}

As you likely deduced already, many combinations of φ and ψ angles are
prohibited due to the principle of steric exclusion, which dictates that two
atoms cannot occupy the same space simultaneously. This concept was initially
demonstrated by [Gopalasamudram
Ramachandran](https://en.wikipedia.org/wiki/G._N._Ramachandran), who developed a
plot to visualize the permissible angle values, known as the Ramachandran plot.
This plot can display the angles of a specific amino acid, all amino acids in a
single protein, or even across many proteins. Analysis of φ and ψ angles in
known proteins reveals that approximately three-quarters of all possible φ, ψ
combinations are not allowed (@fig-rama0) that correspond with common secondary
structure motifs (@fig-ram).

[![General Ramachandran plot. The density of points reflects how likely is each
angle combination, defining the core (red line) and tolerance (orange)
regions.](pics/Ramachandran_plot_general_100K.jpg "General Ramachandran plot"){#fig-rama0
.figure}](https://proteopedia.org/wiki/index.php/Ramachandran_Plot)

![Definition of secondary structure alternatives by their combination of phi,
psi angles.](pics/rama2.png "Textbook Rama plot"){#fig-ram .figure}

Functionally relevant residues are more likely than others to have torsion
angles that plot to the allowed but disfavored regions of a Ramachandran plot.
The specific geometry of these functionally relevant residues, while somewhat
energetically unfavorable, may be important for the protein's function,
catalytic or otherwise. Such conformations need to be stabilized by the protein
using H-bonds, steric packing, or other means, and should very seldom occur for
highly solvent-exposed residues.

[![Ramachandran plots for glycine (left) and proline (right) Inner contour
encloses 98% and 99.9% of Top structures data, indicating the favored and
allowed regions, respectively
.](pics/Ramachandran_Gly_Pro_data_and_contours_T8000_small.jpg){#fig-rama2
.figure}](https://proteopedia.org/wiki/index.php/Ramachandran_Plot)

## Protein folds, domains and motifs

The global three dimensional tertiary structure of a protein is commonly
referred as its **fold**. Within the overall protein fold, we can recognize
distinct **domains** and **motifs.** Domains are compact sections of the protein
that represent structurally and (usually) functionally independent regions. That
means that a domain maintain its main features, even if separated from the
overall protein. On the other hand, motifs are small substructures that are not
necessarily independent and consist of only a few secondary structure stretches.
Indeed, motifs can be also referred as *super-secondary* structure.

The diversity of protein folds, domains and motifs, and combination of those,
can be used for classification of protein structures hierarchically, as in many
other fields of biology. The first classification was proposed in the 70's and
consisted of four groups of folds, as shown in the figure below. *All α*
proteins are based almost entirely on an α-helical structure, and *all
β*-structure are based on β-sheets. *α/β* structure is based on as mixture of
α-helices and β-sheet, often organized as parallel β-strands connected by
α-helices. On the other hand, *α+β* structures consist of discrete α-helix and
β-sheet motifs that are not interwoven (as they are in α/β proteins). Finally,
*small proteins* span polypeptides with no or little secondary structures.

![The four structural protein classes in the classification by Chlothia &
Levitt. Modified from @structur using 1I2T, 1K76, 1H75 and 1EM7
structures.](pics/clasif.png){#fig-chlothia .figure}

## [**Quick exercise**]{style="color:green"}

Explore the structure below and try to understand its structure. Then answer the
question.

{{< mol-rcsb 3MWD viewportShowAnimation=false >}}

```{r wd, echo = FALSE, results = 'asis'}
opts_p <- c("1",answer = "2",
    "3",
   ">3")

cat("**How many unique protein chains are there is this protein?**",longmcq(opts_p))

opts_p <- c("All-α", "All β",
   answer = "α/β",
   "α+β",
   "Small")

cat("**Which structural class is the protein above?**",longmcq(opts_p))
```

As you can see, as the protein is larger, classification gets more difficult.
Moreover, as fold space has become more and more complex, these types of
classifications have been adjusted and extended such that a complete hierarchy
is created. The most commonly referred approaches to this sort of classification
are those used by SCOP and CATH databases, as we will see in the [Structural
Databases](ddbb.html#strDDBB) section.

## [**Practice 1A: Playing with secondary structures**]{style="color:green"}

![](pics/handson.png "Hands-on")

::: callout-important
# Groups

Remember to work on groups as assigned. Groups should be the same for all the
course exercises.
:::

There are a few online alternatives to model any peptide sequence and quickly
see the effect of amino acid composition in the secondary structure. One of the
best-known is Foldit ([www.fold.it](http://www.fold.it), @miller2020), a gaming
platform for biochemistry and structural biology teaching. It is a highly
recommended alternative for most courses related to protein structure.

In this course we are going to try a more recent proposal, twitted by Sergey
Ovchinnikov:


{{< share-post https://twitter.com/sokrypton/status/1535857255647690753 >}}

[![Basic protein amino acids stats for protein design with ColabFold
Single](pics/cheatsheet.png "Protein hallucination"){#fig-single .figure
width="500"}](https://twitter.com/sokrypton/status/1535857255647690753)

This is based on ColabFold (see <https://github.com/sokrypton/ColabFold> and
@mirdita2022), an Alphafold2 (see @jumper2021) free notebook in [Google Colab
notebook](https://colab.research.google.com/?hl=en). All you need is a Google
account and the *cheatsheet* in @fig-single.

Now go to ColabFold Single:
<https://colab.research.google.com/github/sokrypton/af_backprop/blob/beta/examples/AlphaFold_single.ipynb>

Construct some small proteins and compare the output. Use 6 recycle steps to
obtain diverse structures and then click in the "Animate trajectories" chunk.
Note that the first model will take 2-4 min, but the others will be faster. I
provide here some interesting examples (IUPAC one-letter amino acid code):

1.  AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

2.  KKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKKK

3.  PVAVEARENGRLAVRVEGAIAVLIRENGRLVVRVEGG

4.  PELEKHREELGEFLKKETGIAVEIRENGRLEVRVEGYTDVKIEGGTERLKRFLEEL

5.  ACTWEGNKLTCA

**1. Answer the following questions:**\
**- Why is a poly-K more stable (dark blue) than a poly-A?**\
\
**- Could you predict the structure of a poly-V or a poly-G?**\
\
**- What would happen if you introduce a K5W in the structure number 2? and in
the 4?**\
\
\
**2. Now, try to create peptides with a custom motif, such as:**\
\
**- Two helices.**\
**- A four-strands beta-sheet.**\
**- Alpha-beta-beta-alpha.**\