diff --git a/cli/build/index.html b/cli/build/index.html index 2a693046..81d9bc03 100644 --- a/cli/build/index.html +++ b/cli/build/index.html @@ -1,2 +1,2 @@ -Build · PanGraph.jl

Build

Description

Build a multiple sequence alignment pangraph.

Options

NameTypeShort FlagLong FlagDescription
minimum lengthIntegerllenminimum block size for alignment graph (in nucleotides)
block junction costFloataalphaenergy cost for introducing block partitions due to alignment merger
block diversity costFloatbbetaenergy cost for interblock diversity due to alignment merger
circular genomesBooleanccirculartoggle if input genomes are circular
pairwise sensitivityStringssensitivitycontrols the pairwise genome alignment sensitivity of minimap 2. Currently only accepts "5", "10" or "20"
maximum self-mapsIntegerxmax-self-mapmaximum number of iterations to perform block self maps per pairwise graph merger
enforce uppercaseBooleanuupper-casetoggle to force genomes to uppercase characters
distance calculatorStringddistance-backendonly accepts "native" or "mash"
alignment kernelStringkalignment-kernelonly accepts "minimap2" or "mmseqs"
kmer length (mmseqs)IntegerKkmer-lengthkmer length, only used for mmseqs2 alignment kernel. If not specified will use mmseqs default.
consistency checkBooleanttesttoggle to activate consistency check: verifies that input genomes can be exactly reconstructed from the graph
random seedIntrrandom-seedrandom seed for pangraph construction.

Arguments

Expects one or more fasta files. Multiple records within one file are treated as separate genomes Fasta files can be optionally gzipped.

Output

Prints the constructed pangraph as a JSON to stdout.

+Build · PanGraph.jl

Build

Description

Build a multiple sequence alignment pangraph.

Options

NameTypeShort FlagLong FlagDescription
minimum lengthIntegerllenminimum block size for alignment graph (in nucleotides)
block junction costFloataalphaenergy cost for introducing block partitions due to alignment merger
block diversity costFloatbbetaenergy cost for interblock diversity due to alignment merger
circular genomesBooleanccirculartoggle if input genomes are circular
pairwise sensitivityStringssensitivitycontrols the pairwise genome alignment sensitivity of minimap 2. Currently only accepts "5", "10" or "20"
maximum self-mapsIntegerxmax-self-mapmaximum number of iterations to perform block self maps per pairwise graph merger
enforce uppercaseBooleanuupper-casetoggle to force genomes to uppercase characters
distance calculatorStringddistance-backendonly accepts "native" or "mash"
alignment kernelStringkalignment-kernelonly accepts "minimap2" or "mmseqs"
kmer length (mmseqs)IntegerKkmer-lengthkmer length, only used for mmseqs2 alignment kernel. If not specified will use mmseqs default.
consistency checkBooleanttesttoggle to activate consistency check: verifies that input genomes can be exactly reconstructed from the graph
random seedIntrrandom-seedrandom seed for pangraph construction.

Arguments

Expects one or more fasta files. Multiple records within one file are treated as separate genomes Fasta files can be optionally gzipped.

Output

Prints the constructed pangraph as a JSON to stdout.

diff --git a/cli/export/index.html b/cli/export/index.html index 5db5bbbe..2f6ed1cc 100644 --- a/cli/export/index.html +++ b/cli/export/index.html @@ -1,2 +1,2 @@ -Export · PanGraph.jl

Export

Description

Export a pangraph to a chosen file format(s)

Options

NameTypeShort FlagLong FlagDescription
Edge minimum lengthIntegerelledge-minimum-lengthblocks below this length cutoff will be ignored for edges in graph
Edge maximum lengthIntegereluedge-maximum-lengthblocks above this length cutoff will be ignored for edges in graph
Edge minimum depthIntegeredledge-minimum-depthblocks below this depth cutoff will be ignored for edges in graph
Edge maximum depthIntegereduedge-maximum-depthblocks above this depth cutoff will be ignored for edges in graph
Minimum lengthIntegerllminimum-lengthblocks below this length cutoff will be ignored for export
Maximum lengthIntegerlumaximum-lengthblocks above this length cutoff will be ignored for export
Minimum depthIntegerdlminimum-depthblocks below this depth cutoff will be ignored for export
Maximum depthIntegerdumaximum-depthblocks above this depth cutoff will be ignored for export
No duplicationsBooleanndno-duplicationsdo not export any block that contains at least one strain repeated more than once
Output directoryStringooutput-directorypath to directory where output will be stored (default: export)
PrefixStringpprefixbasename of exported files (default: pangraph)
GFABooleanngno-export-gfatoggles whether pangraph is exported as GFA.
PanXBooleanpXexport-panXtoggles whether pangraph is exported to panX visualization compatible format. (requires fasttree)

Arguments

Zero or one pangraph file which must be formatted as a JSON. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped.

Output

Outputs the constructed pangraph to the selected formats at the user-supplied paths.

+Export · PanGraph.jl

Export

Description

Export a pangraph to a chosen file format(s)

Options

NameTypeShort FlagLong FlagDescription
Edge minimum lengthIntegerelledge-minimum-lengthblocks below this length cutoff will be ignored for edges in graph
Edge maximum lengthIntegereluedge-maximum-lengthblocks above this length cutoff will be ignored for edges in graph
Edge minimum depthIntegeredledge-minimum-depthblocks below this depth cutoff will be ignored for edges in graph
Edge maximum depthIntegereduedge-maximum-depthblocks above this depth cutoff will be ignored for edges in graph
Minimum lengthIntegerllminimum-lengthblocks below this length cutoff will be ignored for export
Maximum lengthIntegerlumaximum-lengthblocks above this length cutoff will be ignored for export
Minimum depthIntegerdlminimum-depthblocks below this depth cutoff will be ignored for export
Maximum depthIntegerdumaximum-depthblocks above this depth cutoff will be ignored for export
No duplicationsBooleanndno-duplicationsdo not export any block that contains at least one strain repeated more than once
Output directoryStringooutput-directorypath to directory where output will be stored (default: export)
PrefixStringpprefixbasename of exported files (default: pangraph)
GFABooleanngno-export-gfatoggles whether pangraph is exported as GFA.
PanXBooleanpXexport-panXtoggles whether pangraph is exported to panX visualization compatible format. (requires fasttree)

Arguments

Zero or one pangraph file which must be formatted as a JSON. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped.

Output

Outputs the constructed pangraph to the selected formats at the user-supplied paths.

diff --git a/cli/generate/index.html b/cli/generate/index.html index e60425bd..9ddf0dab 100644 --- a/cli/generate/index.html +++ b/cli/generate/index.html @@ -1,2 +1,2 @@ -Generate · PanGraph.jl

Generate

Description

Generate a simulated multiple sequence alignment pangraph.

Options

NameTypeShort FlagLong FlagDescription
Mutation rateFloatmsnp-rateRate of mutations per site per genome per generation
HGT rateFloatrhgt-rateRate of horizontal transfer events per genome per generation
Deletion rateFloatddelete-rateRate of deletion events per genome per generation
Inversion rateFloatiinvert-rateRate of inversion events per genome per generation
Graph outputStringooutput-pathPath to location to store simulated pangraph
TimeIntegerttimeNumber of generations to simulate before computing sequences and graph

Arguments

Zero or one fasta file to treat as ancestral sequences. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped. The number and length of sequences determine the population size.

Output

Outputs all resultant sequences to standard out. Optionally output the resultant pangraph if a path is supplied by the user.

+Generate · PanGraph.jl

Generate

Description

Generate a simulated multiple sequence alignment pangraph.

Options

NameTypeShort FlagLong FlagDescription
Mutation rateFloatmsnp-rateRate of mutations per site per genome per generation
HGT rateFloatrhgt-rateRate of horizontal transfer events per genome per generation
Deletion rateFloatddelete-rateRate of deletion events per genome per generation
Inversion rateFloatiinvert-rateRate of inversion events per genome per generation
Graph outputStringooutput-pathPath to location to store simulated pangraph
TimeIntegerttimeNumber of generations to simulate before computing sequences and graph

Arguments

Zero or one fasta file to treat as ancestral sequences. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped. The number and length of sequences determine the population size.

Output

Outputs all resultant sequences to standard out. Optionally output the resultant pangraph if a path is supplied by the user.

diff --git a/cli/marginalize/index.html b/cli/marginalize/index.html index 3fdc9217..bff7e526 100644 --- a/cli/marginalize/index.html +++ b/cli/marginalize/index.html @@ -1,2 +1,2 @@ -Marginalize · PanGraph.jl

Marginalize

Description

Compute all pairwise marginalizations of a multiple sequence alignment pangraph.

Options

NameTypeShort FlagLong FlagDescription
Output pathStringooutput-pathPath to direcotry where the output of all pairwise mariginalizations will be stored if supplied
Reduce paralogsBooleanrreduce-paralogCollapses coparallel paths through duplicated blocks.
Projection strainsStringsStrainsCollapses the graph structure to only blocks and edges contained by the paths of the supplied strain names. comma seperated, no spaces
Consistency checkBooleanttesttoggle to activate consistency check: verifies that output genomes are exactly equal to input genomes

Arguments

Zero or one pangraph file which must be formatted as a JSON. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped.

Output

Outputs all pairwise graphs to the directory at the user-supplied path.

+Marginalize · PanGraph.jl

Marginalize

Description

Compute all pairwise marginalizations of a multiple sequence alignment pangraph.

Options

NameTypeShort FlagLong FlagDescription
Output pathStringooutput-pathPath to direcotry where the output of all pairwise mariginalizations will be stored if supplied
Reduce paralogsBooleanrreduce-paralogCollapses coparallel paths through duplicated blocks.
Projection strainsStringsStrainsCollapses the graph structure to only blocks and edges contained by the paths of the supplied strain names. comma seperated, no spaces
Consistency checkBooleanttesttoggle to activate consistency check: verifies that output genomes are exactly equal to input genomes

Arguments

Zero or one pangraph file which must be formatted as a JSON. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped.

Output

Outputs all pairwise graphs to the directory at the user-supplied path.

diff --git a/cli/polish/index.html b/cli/polish/index.html index af12c88a..ef0f8dcb 100644 --- a/cli/polish/index.html +++ b/cli/polish/index.html @@ -1,2 +1,2 @@ -Polish · PanGraph.jl

Polish

Description

Realigns blocks of a multiple sequence alignment pangraph with an external multiple sequence alignment tool. Requires MAFFT command to be available in PATH.

Options

NameTypeShort FlagLong FlagDescription
Maximum LengthIntegerllengthcutoff above which the block is not realigned externally
Preserve CaseBoolcpreserve-caseensure case (upper/lower) is preserved after realignment

Arguments

Zero or one pangraph file which must be formatted as a JSON. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped.

Output

Outputs the polished pangraph to stdout.

+Polish · PanGraph.jl

Polish

Description

Realigns blocks of a multiple sequence alignment pangraph with an external multiple sequence alignment tool. Requires MAFFT command to be available in PATH.

Options

NameTypeShort FlagLong FlagDescription
Maximum LengthIntegerllengthcutoff above which the block is not realigned externally
Preserve CaseBoolcpreserve-caseensure case (upper/lower) is preserved after realignment

Arguments

Zero or one pangraph file which must be formatted as a JSON. If no file path is given, reads from stdin. In either case, the stream can be optionally gzipped.

Output

Outputs the polished pangraph to stdout.

diff --git a/cli/version/index.html b/cli/version/index.html index ca3cf8c0..68453d99 100644 --- a/cli/version/index.html +++ b/cli/version/index.html @@ -1,2 +1,2 @@ -Version · PanGraph.jl
+Version · PanGraph.jl
diff --git a/dev/building-docker/index.html b/dev/building-docker/index.html index 73b303cb..1781c8db 100644 --- a/dev/building-docker/index.html +++ b/dev/building-docker/index.html @@ -4,4 +4,4 @@ --workdir="/workdir" \ --user="$(id -u):$(id -g)" \ neherlab/pangraph \ - bash tests/run-cli-tests.sh

Or more simply using instructions in the Makefile:

make docker-test

This will test all the available commands, see tests/run-cli-tests.sh script.

+ bash tests/run-cli-tests.sh

Or more simply using instructions in the Makefile:

make docker-test

This will test all the available commands, see tests/run-cli-tests.sh script.

diff --git a/dev/building-documentation/index.html b/dev/building-documentation/index.html index 6a072f03..83564251 100644 --- a/dev/building-documentation/index.html +++ b/dev/building-documentation/index.html @@ -1,2 +1,2 @@ -Updating the documentation · PanGraph.jl

Updating the documentation

The documentation is hosted on https://neherlab.github.io/pangraph/.

Automated documentation releases

The Continuous integration (CI) will trigger a build and deployment of the docs website to GitHub Pages on every git tag (along with the main build). You can track the build process on GitHub Actions:

https://github.com/neherlab/pangraph/actions

The docs CI is configured in

.github/workflows/docs.yml

Build manually

To build a new version of the documentation manually, run make documentation. This will create the directory docs/build.

Preview locally

You can preview the resulting docs website locally by serving the resulting docs/build, using any static file server. For example, if you have Node.js and npx installed, you could run serve package:

npx serve --listen=tcp://localhost:8888 docs/build

and then open http://localhost:8888 in a browser.

Release manually

To release a new version of the documentation manually (bypassing the CI workflow):

  1. build the documentation
  2. switch to the gh-pages branch of the repository
  3. Substitute the content of the previously-created docs/build into the repo main directory and commit the changes.
+Updating the documentation · PanGraph.jl

Updating the documentation

The documentation is hosted on https://neherlab.github.io/pangraph/.

Automated documentation releases

The Continuous integration (CI) will trigger a build and deployment of the docs website to GitHub Pages on every git tag (along with the main build). You can track the build process on GitHub Actions:

https://github.com/neherlab/pangraph/actions

The docs CI is configured in

.github/workflows/docs.yml

Build manually

To build a new version of the documentation manually, run make documentation. This will create the directory docs/build.

Preview locally

You can preview the resulting docs website locally by serving the resulting docs/build, using any static file server. For example, if you have Node.js and npx installed, you could run serve package:

npx serve --listen=tcp://localhost:8888 docs/build

and then open http://localhost:8888 in a browser.

Release manually

To release a new version of the documentation manually (bypassing the CI workflow):

  1. build the documentation
  2. switch to the gh-pages branch of the repository
  3. Substitute the content of the previously-created docs/build into the repo main directory and commit the changes.
diff --git a/dev/releasing/index.html b/dev/releasing/index.html index be30fd40..aaca73b9 100644 --- a/dev/releasing/index.html +++ b/dev/releasing/index.html @@ -6,4 +6,4 @@ --volume="$(pwd)/path-to-fasta:/workdir" \ --user="$(id -u):$(id -g)" \ --workdir=/workdir neherlab/pangraph:$RELEASE_VERSION \ - bash -c "pangraph build --circular --alpha 0 --beta 0 /workdir/test.fa"

Here we mount local directory path-to-fasta as /workdir so that pangraph can read the /workdir/test.fa" file.

👷 TODO: implement automated tests

Modifying continuous integration workflow

See .github/workflows/build.yml

Modifying Docker image

See Dockerfile

+ bash -c "pangraph build --circular --alpha 0 --beta 0 /workdir/test.fa"

Here we mount local directory path-to-fasta as /workdir so that pangraph can read the /workdir/test.fa" file.

👷 TODO: implement automated tests

Modifying continuous integration workflow

See .github/workflows/build.yml

Modifying Docker image

See Dockerfile

diff --git a/index.html b/index.html index 2a2cd1d6..66d18db4 100644 --- a/index.html +++ b/index.html @@ -10,4 +10,4 @@ --workdir=/workdir \ neherlab/pangraph:latest \ bash -c "pangraph build --circular --alpha 0 --beta 0 /workdir/example_datasets/ecoli.fa.gz > graph.json"

Replace the :latest tag with either an explicit version, e.g. :1.2.3 or :master, depending on which version you pulled in the previous section. If you haven't run docker pull, the docker run command should pull the corresponding version for you.

Here we mount current directory . (expressed as absolute path, using pwd shell command) as /workdir into the container so that pangraph can read the local file ./example_datasets/ecoli.fa.gz as /workdir/example_datasets/ecoli.fa.gz":

                                 . -> /workdir
-    ./example_datasets/ecoli.fa.gz -> /workdir/example_datasets/ecoli.fa.gz

The --name flag sets the name of the container and the date command there ensures that a unique name is created on every run. This is optional. The --rm flag deletes the container (but not the image) after run.

Replace :latest with a specific version if desired. The :latest tag can also be omitted, as it is the default.

Building binaries locally

PanGraph can be built locally on your machine by running (inside the cloned repo)

    export jc="path/to/julia/executable" make pangraph && make install

This will build the executable and place a symlink into bin/.

Importantly, if jc is not explicitly set, it will default to vendor/julia-$VERSION/bin/julia. If this file does not exist, we will download automatically for the user, provided the host system is Linux or MacOSX. Moreover, for the compilation to work, it is necessary to have MAFFT and mmseqs2 available in your $PATH, see optional dependencies.

Note, it is recommended by the PackageCompiler.jl documentation to utilize the officially distributed binaries for Julia, not those distributed by your Linux distribution. As such, compilation may not work if you attempt to do so.

Optional dependencies

There are a few optional external programs that PanGraph can utilize[1]:

  1. Mash can be used to construct a guide tree in place of our internal algorithm (see build command options).
  2. MAFFT can be optionally used to polish block alignments (see polish command). Only recommended for short alignments.
  3. mmseqs2 can be used as an alternative alignment kernel to the default minimap2 (see build command options). It allows merging of more diverged sequences, at the cost of higher computational time.
  4. fasttree is used to build phylogenetic trees for export in PanX-compatible format (see export command options and the tutorial section).

In order to invoke all functionalities from PanGraph, these tools must be installed and available on $PATH.

If conda is available, one can run the following command to install all of these dependencies in a new environment named pangraph:

conda create -n pangraph -c conda-forge -c bioconda mmseqs2=13.45111 mash=2.2.2 mafft=7.475 FastTree=2.1.11

Alternatively, a script bin/setup-pangraph is provided within the repository to install both dependencies for a Linux machine without access to root. It assumes GNU coreutils are available.

These dependencies are already available within the Docker container.

User's Guide

Basic functionality of PanGraph is provided by a command line interface. This includes multiple genome alignment, the export of a genome alignment to various visualization formats, alignment polishing, and genome comparison tool. For more details please refer to the Tutorials section of the documentation.

Multithreading support is baked into the provided binary. Unfortunately, due to limitations in julia, the number of threads is set by the environment variable JULIA_NUM_THREADS

For uncovered use cases, functionality can be added by utilizing the underlying library functions. Please see the high-level overview for definitions of library terminology.

Citing PanGraph

PanGraph: scalable bacterial pan-genome graph construction. Nicholas Noll, Marco Molari, Liam P. Shaw, Richard Neher bioRxiv 2022.02.24.481757; doi: https://doi.org/10.1101/2022.02.24.481757

+ ./example_datasets/ecoli.fa.gz -> /workdir/example_datasets/ecoli.fa.gz

The --name flag sets the name of the container and the date command there ensures that a unique name is created on every run. This is optional. The --rm flag deletes the container (but not the image) after run.

Replace :latest with a specific version if desired. The :latest tag can also be omitted, as it is the default.

Building binaries locally

PanGraph can be built locally on your machine by running (inside the cloned repo)

    export jc="path/to/julia/executable" make pangraph && make install

This will build the executable and place a symlink into bin/.

Importantly, if jc is not explicitly set, it will default to vendor/julia-$VERSION/bin/julia. If this file does not exist, we will download automatically for the user, provided the host system is Linux or MacOSX. Moreover, for the compilation to work, it is necessary to have MAFFT and mmseqs2 available in your $PATH, see optional dependencies.

Note, it is recommended by the PackageCompiler.jl documentation to utilize the officially distributed binaries for Julia, not those distributed by your Linux distribution. As such, compilation may not work if you attempt to do so.

Optional dependencies

There are a few optional external programs that PanGraph can utilize[1]:

  1. Mash can be used to construct a guide tree in place of our internal algorithm (see build command options).
  2. MAFFT can be optionally used to polish block alignments (see polish command). Only recommended for short alignments.
  3. mmseqs2 can be used as an alternative alignment kernel to the default minimap2 (see build command options). It allows merging of more diverged sequences, at the cost of higher computational time.
  4. fasttree is used to build phylogenetic trees for export in PanX-compatible format (see export command options and the tutorial section).

In order to invoke all functionalities from PanGraph, these tools must be installed and available on $PATH.

If conda is available, one can run the following command to install all of these dependencies in a new environment named pangraph:

conda create -n pangraph -c conda-forge -c bioconda mmseqs2=13.45111 mash=2.2.2 mafft=7.475 FastTree=2.1.11

Alternatively, a script bin/setup-pangraph is provided within the repository to install both dependencies for a Linux machine without access to root. It assumes GNU coreutils are available.

These dependencies are already available within the Docker container.

User's Guide

Basic functionality of PanGraph is provided by a command line interface. This includes multiple genome alignment, the export of a genome alignment to various visualization formats, alignment polishing, and genome comparison tool. For more details please refer to the Tutorials section of the documentation.

Multithreading support is baked into the provided binary. Unfortunately, due to limitations in julia, the number of threads is set by the environment variable JULIA_NUM_THREADS

For uncovered use cases, functionality can be added by utilizing the underlying library functions. Please see the high-level overview for definitions of library terminology.

Citing PanGraph

PanGraph: scalable bacterial pan-genome graph construction. Nicholas Noll, Marco Molari, Liam P. Shaw, Richard Neher bioRxiv 2022.02.24.481757; doi: https://doi.org/10.1101/2022.02.24.481757

diff --git a/lib/align/index.html b/lib/align/index.html index 31549955..5d0a92f8 100644 --- a/lib/align/index.html +++ b/lib/align/index.html @@ -1,8 +1,8 @@ -Alignment · PanGraph.jl

Alignment

Types

PanGraph.Graphs.Align.CladeMethod
Clade(distance, names; algo=:nj)

Generate a tree from a matrix of pairwise distances distance. The names of leafs are given by an array of strings names. algo dictates the algorithm used to transform the distance matrix into a tree. Currently on neighbor joining (:nj) is supported.

source
PanGraph.Graphs.Align.MessageType
mutable struct Clade
+Alignment · PanGraph.jl

Alignment

Types

PanGraph.Graphs.Align.CladeMethod
Clade(distance, names; algo=:nj)

Generate a tree from a matrix of pairwise distances distance. The names of leafs are given by an array of strings names. algo dictates the algorithm used to transform the distance matrix into a tree. Currently on neighbor joining (:nj) is supported.

source
PanGraph.Graphs.Align.MessageType
mutable struct Clade
 	name   :: String
 	parent :: Union{Clade,Nothing}
 	left   :: Union{Clade,Nothing}
 	right  :: Union{Clade,Nothing}
 	graph  :: Channel{Tuple{Graph,Int}}
-end

Clade is a node (internal or leaf) of a binary guide tree used to order pairwise alignments associated to a multiple genome alignment in progress. name is only non-empty for leaf nodes. parent is nothing for the root node. graph is a 0-sized channel that is used as a message passing primitive in alignment. It contains the graph and an index used to decide the order of items in a pair in pairwise graph merge.

source

Functions

PanGraph.Graphs.Align.alignMethod
align(aligner::Function, Gs::Graph...; compare=Mash.distance, energy=(hit)->(-Inf), minblock=100, reference=nothing, maxiter=100)

Aligns a collection of graphs Gs using the specified aligner function to recover hits. Graphs are aligned following an internal guide tree, generated using kmer distance.

energy is to be a function that takes an alignment between two blocks and produces a score. The lower the score, the better the alignment. Only negative energies are considered.

minblock is the minimum size block that will be produced from the algorithm. maxiter is maximum number of duplications that will be considered during this alignment.

compare is the function to be used to generate pairwise distances that generate the internal guide tree.

source
PanGraph.Graphs.Align.align_pairMethod
align_pair(G₁::Graph, G₂::Graph, energy::Function, minblock::Int, verify::Function, verbose::Bool; maxiter=100)

Align graph G₁ to graph G₂ by looking for homology between consensus sequences of blocks. This is a low-level function.

energy is to be a function that takes an alignment between two blocks and produces a score. The lower the score, the better the alignment. Only negative energies are considered.

minblock is the minimum size block that will be produced from the algorithm. maxiter is maximum number of duplications that will be considered during this alignment.

source
PanGraph.Graphs.Align.align_selfMethod
align_self(G₁::Graph, energy::Function, minblock::Int, verify::Function, verbose::Bool; maxiter=100)

Align graph G₁ to itself by looking for homology between blocks. This is a low-level function.

energy is to be a function that takes an alignment between two blocks and produces a score. The lower the score, the better the alignment. Only negative energies are considered.

minblock is the minimum size block that will be produced from the algorithm. maxiter is maximum number of duplications that will be considered during this alignment.

source
PanGraph.Graphs.Align.njMethod
nj(distance, names)

Lower-level function. Generate a tree from a matrix of pairwise distances distance. The names of leafs are given by an array of strings names. Uses neighbor joining.

source
PanGraph.Graphs.Align.orderingMethod
ordering(compare, Gs...)

Return a guide tree based upon distances computed from a collection of graphs Gs, using method compare. The signature of compare is expected to be compare(G::Graphs....) -> distance, names. Graphs Gs... are expected to be singleton graphs.

source
+end

Clade is a node (internal or leaf) of a binary guide tree used to order pairwise alignments associated to a multiple genome alignment in progress. name is only non-empty for leaf nodes. parent is nothing for the root node. graph is a 0-sized channel that is used as a message passing primitive in alignment. It contains the graph and an index used to decide the order of items in a pair in pairwise graph merge.

source

Functions

PanGraph.Graphs.Align.alignMethod
align(aligner::Function, Gs::Graph...; compare=Mash.distance, energy=(hit)->(-Inf), minblock=100, reference=nothing, maxiter=100)

Aligns a collection of graphs Gs using the specified aligner function to recover hits. Graphs are aligned following an internal guide tree, generated using kmer distance.

energy is to be a function that takes an alignment between two blocks and produces a score. The lower the score, the better the alignment. Only negative energies are considered.

minblock is the minimum size block that will be produced from the algorithm. maxiter is maximum number of duplications that will be considered during this alignment.

compare is the function to be used to generate pairwise distances that generate the internal guide tree.

source
PanGraph.Graphs.Align.align_pairMethod
align_pair(G₁::Graph, G₂::Graph, energy::Function, minblock::Int, verify::Function, verbose::Bool; maxiter=100)

Align graph G₁ to graph G₂ by looking for homology between consensus sequences of blocks. This is a low-level function.

energy is to be a function that takes an alignment between two blocks and produces a score. The lower the score, the better the alignment. Only negative energies are considered.

minblock is the minimum size block that will be produced from the algorithm. maxiter is maximum number of duplications that will be considered during this alignment.

source
PanGraph.Graphs.Align.align_selfMethod
align_self(G₁::Graph, energy::Function, minblock::Int, verify::Function, verbose::Bool; maxiter=100)

Align graph G₁ to itself by looking for homology between blocks. This is a low-level function.

energy is to be a function that takes an alignment between two blocks and produces a score. The lower the score, the better the alignment. Only negative energies are considered.

minblock is the minimum size block that will be produced from the algorithm. maxiter is maximum number of duplications that will be considered during this alignment.

source
PanGraph.Graphs.Align.njMethod
nj(distance, names)

Lower-level function. Generate a tree from a matrix of pairwise distances distance. The names of leafs are given by an array of strings names. Uses neighbor joining.

source
PanGraph.Graphs.Align.orderingMethod
ordering(compare, Gs...)

Return a guide tree based upon distances computed from a collection of graphs Gs, using method compare. The signature of compare is expected to be compare(G::Graphs....) -> distance, names. Graphs Gs... are expected to be singleton graphs.

source
diff --git a/lib/block/index.html b/lib/block/index.html index db515135..66d0acce 100644 --- a/lib/block/index.html +++ b/lib/block/index.html @@ -6,10 +6,10 @@ mutate :: OrderedDict{Node{Block},SNPMap} insert :: OrderedDict{Node{Block},InsMap} delete :: OrderedDict{Node{Block},DelMap} -end

Store a multiple sequence alignment of contiguous DNA related by homology. Use as a component of a larger, branching multiple genome alignment. uuid is a string identifier unique to each block sequence is the consensus (majority-rule) sequence gaps recapitulate all locations of insertions for generating the full sequence alignment. mutate, insert, and delete store polymorphisms of each genome contained within the block.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps,mutate,insert,delete)

Construct a block with a unique uuid.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps)

Construct a block with a unique uuid with fixed sequence and gaps. No individuals and thus polymorphisms are initialized.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps)

Construct a block with a unique uuid with fixed sequence.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(b::Block, slice)

Return a subsequence associated to block b at interval slice. The returned block has a newly generated uuid.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(bs::Block...)

Concatenate a variable number of blocks into one larger block. The returned block has a newly generated uuid.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps)

Construct a block with a unique uuid. All fields are empty.

source
PanGraph.Graphs.Blocks.PairPosType
mutable struct PairPos
+end

Store a multiple sequence alignment of contiguous DNA related by homology. Use as a component of a larger, branching multiple genome alignment. uuid is a string identifier unique to each block sequence is the consensus (majority-rule) sequence gaps recapitulate all locations of insertions for generating the full sequence alignment. mutate, insert, and delete store polymorphisms of each genome contained within the block.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps,mutate,insert,delete)

Construct a block with a unique uuid.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps)

Construct a block with a unique uuid with fixed sequence and gaps. No individuals and thus polymorphisms are initialized.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps)

Construct a block with a unique uuid with fixed sequence.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(b::Block, slice)

Return a subsequence associated to block b at interval slice. The returned block has a newly generated uuid.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(bs::Block...)

Concatenate a variable number of blocks into one larger block. The returned block has a newly generated uuid.

source
PanGraph.Graphs.Blocks.BlockMethod
Block(sequence,gaps)

Construct a block with a unique uuid. All fields are empty.

source
PanGraph.Graphs.Blocks.PairPosType
mutable struct PairPos
     qry :: Maybe{Pos}
     ref :: Maybe{Pos}
-end

Representation of matched pair of intervals within a pairwise alignment. qry can be of type Pos or Nothing ref can be of type Pos or Nothing If either ref or qry is nothing, the PairPos corresponds to an insertion or deletion respectively.

source
PanGraph.Graphs.Blocks.PosType
mutable struct Pos
+end

Representation of matched pair of intervals within a pairwise alignment. qry can be of type Pos or Nothing ref can be of type Pos or Nothing If either ref or qry is nothing, the PairPos corresponds to an insertion or deletion respectively.

source
PanGraph.Graphs.Blocks.PosType
mutable struct Pos
     start :: Int
     stop  :: Int
-end

Representation of a single interval within a pairwise alignment. Inclusive on both ends, i.e. includes start and stop Used internally to unpack cigar strings.

source

Functions

Base.append!Method
append!(b::Block, node::Node{Block}, snp::Maybe{SNPMap}, ins::Maybe{InsMap}, del::Maybe{DelMap})

Adds a new genome at node to multiple sequence alignment block b. Polymorphisms are optional. If nothing is passed instead, an empty dictionary will be used.

source
Base.lengthMethod
length(b::Block, n::Node)

Return the length of the sequence of node n within the multiple alignment of block b.

source
Base.lengthMethod
length(b::Block)

Return the length of consensus sequence of the multiple alignment of block b.

source
Base.pop!Method
pop!(b::Block, n::Node)

Remove genome of Node n from Block b.

source
PanGraph.Graphs.Blocks.allele_positionsMethod
allele_positions(snp::SNPMap, ins::InsMap, del::DelMap)

Return an iterator over polymorphic loci, i.e. SNPs and Indels. The iterator will be sorted by position in ascending order.

source
PanGraph.Graphs.Blocks.allele_positionsMethod
allele_positions(b::Block, n::Node)

Return an iterator over polymorphic loci for node n contained within block b The iterator will be sorted by position in ascending order.

source
PanGraph.Graphs.Blocks.applyallelesMethod
applyalleles(seq::Array{UInt8}, mutate::SNPMap, insert::InsMap, delete::DelMap)

Take a sequence and apply polymorphisms, as given by mutate, insert, and delete. Return the brand new allocated sequence.

source
PanGraph.Graphs.Blocks.assert_equalsMethod
assert_equal(b₁::Block, b₂::Block)

Throw an error in block b₁ is not equivalent to block b₂. Useful for internal debugging.

source
PanGraph.Graphs.Blocks.checkMethod
check(b::Block)

Check whether block b is internally self-consistent. Useful for debugging internals.

source
PanGraph.Graphs.Blocks.combineMethod
combine(qry::Block, ref::Block, aln::Alignment; minblock=500)

Take a pairwise alignment aln from the consensus of qry to ref and merge both. The resultant new block, with a novel uuid is returned. Alignment aln is a segmented set of intervals mapping homologous regions of one block into the other. Parameter minblock is the cutoff length of an indel, above which a new block will be created.

source
PanGraph.Graphs.Blocks.depthMethod
depth(b::Block)

Return the number of genomes contained within the alignment

source
PanGraph.Graphs.Blocks.diversityMethod
diversity(b::Block)

Return the averaged fraction of loci that are mutated within the multiple sequence alignment of block b.

source
PanGraph.Graphs.Blocks.partitionMethod
partition(alignment; minblock=500)

Parse the alignment into matched intervals of a pairwise alignment. If any insertion or deletion is larger than minblock, a new block is created to hold the homologous interval. This ensures that all blocks are at least minblock long and no block contains an insertion or deletion longer than itself.

alignment is assumed to be an data structure from the Utility module

source
PanGraph.Graphs.Blocks.reconsensus!Method
reconsensus!(b::Block)

Update the consensus sequence of block b by majority-rule over the multiple sequence alignment.

source
PanGraph.Graphs.Blocks.regap!Method
regap!(b::Block)

Recompute the positions of gaps within the multiple sequence of block b

source
PanGraph.Graphs.Blocks.rereferenceMethod
rerefence(qry::Block, ref::Block, aligment)

Take a pairwise alignment segments from the consensus of qry to ref and rereference all polymorphisms of qry to the consensus sequence of `ref. Low-level function used by higher-level API.

source
PanGraph.Graphs.Blocks.swap!Method
swap!(b::Block, oldkey::Array{Node{Block}}, newkey::Node{Block})

Remove all polymorphisms associated to all keys within oldkey. Concatenate and reassociate them to newkey.

source
PanGraph.Graphs.Blocks.swap!Method
swap!(b::Block, oldkey::Node{Block}, newkey::Node{Block})

Remove all polymorphisms associated to oldkey and reassociate them to newkey.

source
PanGraph.Graphs.marshal_fastaMethod
marshal_fasta(io::IO, b::Block; opt=nothing)

Serialize the multiple sequence alignment of block b to a fasta format to IO stream io. Each sequence will be serialized as-is, i.e. with no gaps.

If opt is not nothing, the output will be an aligned fasta file. Futhermore, opt is interpreted as a function to be called per internal node that gives a unique name for each fasta record that is generated per node.

source
PanGraph.Graphs.reverse_complementMethod
reverse_complement(b::Block; keepid=false)

Return the reverse complement of the multiple sequence alignment within Block b. By default, will return a block with a new uuid, unless keepid is set to true.

source
PanGraph.Graphs.sequence!Method
sequence!(seq, b::Block, node::Node{Block}; gaps=false)

Mutate the sequence buffer seq in place to hold the sequence associated to genome node within sequence alignment of block b. By default, gaps (charater '-') will not be returned, unless gaps is set to true. Return the sequence with gap characters to generate the full sequence alignment.

source
PanGraph.Graphs.sequenceMethod
sequence(seq, b::Block, node::Node{Block}; gaps=false, forward=false)

Return the sequence associated to genome node within sequence alignment of block b. By default, gaps (charater '-') will not be returned, unless gaps is set to true. Return the sequence with gap characters can be used to generate the full sequence alignment. If forward is true, the true orientation of the genome is ignored and will be returned to align to the forward consensus.

source
PanGraph.Graphs.sequenceMethod
sequence(b::Block; gaps=false)

Return the consensus of the multiple sequence alignment within block b. By default, gaps (charater '-') will not be returned, unless gaps is set to true. Return the consensus alignment with gaps is useful for generating the full sequence alignment.

source
+end

Representation of a single interval within a pairwise alignment. Inclusive on both ends, i.e. includes start and stop Used internally to unpack cigar strings.

source

Functions

Base.append!Method
append!(b::Block, node::Node{Block}, snp::Maybe{SNPMap}, ins::Maybe{InsMap}, del::Maybe{DelMap})

Adds a new genome at node to multiple sequence alignment block b. Polymorphisms are optional. If nothing is passed instead, an empty dictionary will be used.

source
Base.lengthMethod
length(b::Block, n::Node)

Return the length of the sequence of node n within the multiple alignment of block b.

source
Base.lengthMethod
length(b::Block)

Return the length of consensus sequence of the multiple alignment of block b.

source
Base.pop!Method
pop!(b::Block, n::Node)

Remove genome of Node n from Block b.

source
PanGraph.Graphs.Blocks.allele_positionsMethod
allele_positions(snp::SNPMap, ins::InsMap, del::DelMap)

Return an iterator over polymorphic loci, i.e. SNPs and Indels. The iterator will be sorted by position in ascending order.

source
PanGraph.Graphs.Blocks.allele_positionsMethod
allele_positions(b::Block, n::Node)

Return an iterator over polymorphic loci for node n contained within block b The iterator will be sorted by position in ascending order.

source
PanGraph.Graphs.Blocks.applyallelesMethod
applyalleles(seq::Array{UInt8}, mutate::SNPMap, insert::InsMap, delete::DelMap)

Take a sequence and apply polymorphisms, as given by mutate, insert, and delete. Return the brand new allocated sequence.

source
PanGraph.Graphs.Blocks.assert_equalsMethod
assert_equal(b₁::Block, b₂::Block)

Throw an error in block b₁ is not equivalent to block b₂. Useful for internal debugging.

source
PanGraph.Graphs.Blocks.checkMethod
check(b::Block)

Check whether block b is internally self-consistent. Useful for debugging internals.

source
PanGraph.Graphs.Blocks.combineMethod
combine(qry::Block, ref::Block, aln::Alignment; minblock=500)

Take a pairwise alignment aln from the consensus of qry to ref and merge both. The resultant new block, with a novel uuid is returned. Alignment aln is a segmented set of intervals mapping homologous regions of one block into the other. Parameter minblock is the cutoff length of an indel, above which a new block will be created.

source
PanGraph.Graphs.Blocks.depthMethod
depth(b::Block)

Return the number of genomes contained within the alignment

source
PanGraph.Graphs.Blocks.diversityMethod
diversity(b::Block)

Return the averaged fraction of loci that are mutated within the multiple sequence alignment of block b.

source
PanGraph.Graphs.Blocks.partitionMethod
partition(alignment; minblock=500)

Parse the alignment into matched intervals of a pairwise alignment. If any insertion or deletion is larger than minblock, a new block is created to hold the homologous interval. This ensures that all blocks are at least minblock long and no block contains an insertion or deletion longer than itself.

alignment is assumed to be an data structure from the Utility module

source
PanGraph.Graphs.Blocks.reconsensus!Method
reconsensus!(b::Block)

Update the consensus sequence of block b by majority-rule over the multiple sequence alignment.

source
PanGraph.Graphs.Blocks.regap!Method
regap!(b::Block)

Recompute the positions of gaps within the multiple sequence of block b

source
PanGraph.Graphs.Blocks.rereferenceMethod
rerefence(qry::Block, ref::Block, aligment)

Take a pairwise alignment segments from the consensus of qry to ref and rereference all polymorphisms of qry to the consensus sequence of `ref. Low-level function used by higher-level API.

source
PanGraph.Graphs.Blocks.swap!Method
swap!(b::Block, oldkey::Array{Node{Block}}, newkey::Node{Block})

Remove all polymorphisms associated to all keys within oldkey. Concatenate and reassociate them to newkey.

source
PanGraph.Graphs.Blocks.swap!Method
swap!(b::Block, oldkey::Node{Block}, newkey::Node{Block})

Remove all polymorphisms associated to oldkey and reassociate them to newkey.

source
PanGraph.Graphs.marshal_fastaMethod
marshal_fasta(io::IO, b::Block; opt=nothing)

Serialize the multiple sequence alignment of block b to a fasta format to IO stream io. Each sequence will be serialized as-is, i.e. with no gaps.

If opt is not nothing, the output will be an aligned fasta file. Futhermore, opt is interpreted as a function to be called per internal node that gives a unique name for each fasta record that is generated per node.

source
PanGraph.Graphs.reverse_complementMethod
reverse_complement(b::Block; keepid=false)

Return the reverse complement of the multiple sequence alignment within Block b. By default, will return a block with a new uuid, unless keepid is set to true.

source
PanGraph.Graphs.sequence!Method
sequence!(seq, b::Block, node::Node{Block}; gaps=false)

Mutate the sequence buffer seq in place to hold the sequence associated to genome node within sequence alignment of block b. By default, gaps (charater '-') will not be returned, unless gaps is set to true. Return the sequence with gap characters to generate the full sequence alignment.

source
PanGraph.Graphs.sequenceMethod
sequence(seq, b::Block, node::Node{Block}; gaps=false, forward=false)

Return the sequence associated to genome node within sequence alignment of block b. By default, gaps (charater '-') will not be returned, unless gaps is set to true. Return the sequence with gap characters can be used to generate the full sequence alignment. If forward is true, the true orientation of the genome is ignored and will be returned to align to the forward consensus.

source
PanGraph.Graphs.sequenceMethod
sequence(b::Block; gaps=false)

Return the consensus of the multiple sequence alignment within block b. By default, gaps (charater '-') will not be returned, unless gaps is set to true. Return the consensus alignment with gaps is useful for generating the full sequence alignment.

source
diff --git a/lib/edge/index.html b/lib/edge/index.html index a495d7d0..e7d2315c 100644 --- a/lib/edge/index.html +++ b/lib/edge/index.html @@ -3,9 +3,9 @@ block :: Tuple{Block, Block} invert :: Bool # changes strand nodes :: Array{Position} -end

Store a unique edge within a pangraph. An edge is undirected and is defined by the two juxtaposed blocks, as well as a relative orientation. Contain all positions of all genomes that contain the edge.

source
PanGraph.Graphs.Edges.PositionType
struct Position
+end

Store a unique edge within a pangraph. An edge is undirected and is defined by the two juxtaposed blocks, as well as a relative orientation. Contain all positions of all genomes that contain the edge.

source
PanGraph.Graphs.Edges.PositionType
struct Position
     path  :: Path
     node  :: Tuple{Node{Block},Node{Block}}
     index :: Tuple{Int,Int} # positions on path
     locus :: Int # breakpoint on sequence
-end

Store a single position of an edge/breakpoint between homologous pancontigs in an individual genome. path is the containing Path object. node stores the junction of nodes that represent the position of the breakpoint. index is the indices of node within path. locus is the physical location on the genome of the breakpoint.

source

Functions

PanGraph.Graphs.Edges.deparalog!Method
deparalog!(G)

Split duplicated blocks that have non-intersecting, seperable paths that run in parallel and transitively connect associated genomes. Use to simplify high copy number blocks found in all individuals in equivalent contexts within pangraph G.

source
PanGraph.Graphs.Edges.edgesMethod
edges(G)

Compute all edges associated with pangraph G.

source
PanGraph.Graphs.Edges.isolatesMethod
isolates(positions::Array{Position,1})

Compute the array of Position values for each isolate.

source
PanGraph.Graphs.Edges.nextMethod
next(x::Position, blk::Block)

Compute the next position from x that is connected through block blk. Use to traverse the path of an individual genome through a pangraph.

source
+end

Store a single position of an edge/breakpoint between homologous pancontigs in an individual genome. path is the containing Path object. node stores the junction of nodes that represent the position of the breakpoint. index is the indices of node within path. locus is the physical location on the genome of the breakpoint.

source

Functions

PanGraph.Graphs.Edges.deparalog!Method
deparalog!(G)

Split duplicated blocks that have non-intersecting, seperable paths that run in parallel and transitively connect associated genomes. Use to simplify high copy number blocks found in all individuals in equivalent contexts within pangraph G.

source
PanGraph.Graphs.Edges.edgesMethod
edges(G)

Compute all edges associated with pangraph G.

source
PanGraph.Graphs.Edges.isolatesMethod
isolates(positions::Array{Position,1})

Compute the array of Position values for each isolate.

source
PanGraph.Graphs.Edges.nextMethod
next(x::Position, blk::Block)

Compute the next position from x that is connected through block blk. Use to traverse the path of an individual genome through a pangraph.

source
diff --git a/lib/gfa/index.html b/lib/gfa/index.html index f0fe3b5d..43c01974 100644 --- a/lib/gfa/index.html +++ b/lib/gfa/index.html @@ -3,8 +3,8 @@ name :: String segments :: Array{Node,1} circular :: Bool -end

Store a GFA path, i.e. a sequence of segments that represents an observed genome.

source
PanGraph.Graphs.GFA.SegmentType
struct Segment
+end

Store a GFA path, i.e. a sequence of segments that represents an observed genome.

source
PanGraph.Graphs.GFA.SegmentType
struct Segment
     name     :: String
     sequence :: Array{UInt8}
     depth    :: Int
-end

Store a GFA segment, i.e. an edge of an alignment graph that holds a contiguous sequence Depth is the number of genomes, including duplications, that contain the sequence.

source

Functions

PanGraph.Graphs.marshal_gfaMethod
marshal_gfa(io::IO, G::Graph; opt=nothing)

Output pangraph G to IO stream io. opt can include two functions, to be accessed in fields connect and output. connect is a function that takes a node and returns true or false if it should be connected in the GFA output. output is an equivalent function signature, but controls whether the node is output at all.

source
+end

Store a GFA segment, i.e. an edge of an alignment graph that holds a contiguous sequence Depth is the number of genomes, including duplications, that contain the sequence.

source

Functions

PanGraph.Graphs.marshal_gfaMethod
marshal_gfa(io::IO, G::Graph; opt=nothing)

Output pangraph G to IO stream io. opt can include two functions, to be accessed in fields connect and output. connect is a function that takes a node and returns true or false if it should be connected in the GFA output. output is an equivalent function signature, but controls whether the node is output at all.

source
diff --git a/lib/graph/index.html b/lib/graph/index.html index 4d940dd7..50be11f1 100644 --- a/lib/graph/index.html +++ b/lib/graph/index.html @@ -1,5 +1,5 @@ -Graphs · PanGraph.jl

Graphs

Types

PanGraph.Graphs.DelMapType
DelMap = Dict{Int,Int}

A sparse array of deletion events relative to a consensus. The key is the locus (inclusive) of the deletion; the value is the length.

source
PanGraph.Graphs.GraphType
struct Graph
+Graphs · PanGraph.jl

Graphs

Types

PanGraph.Graphs.DelMapType
DelMap = Dict{Int,Int}

A sparse array of deletion events relative to a consensus. The key is the locus (inclusive) of the deletion; the value is the length.

source
PanGraph.Graphs.GraphType
struct Graph
     block    :: Dict{String, Block}
     sequence :: Dict{String, Path}
-end

Representation of a multiple sequence alignment. Alignments of homologous sequences are stored as blocks. A genome is stored as a path, i.e. a list of blocks.

source
PanGraph.Graphs.GraphMethod
Graph(name::String, sequence::Array{UInt8}; circular=false)

Creates a singleton graph from sequence. name is assumed to be a unique identifier. If circular is unspecified, the sequence is assumed to be linear.

source
PanGraph.Graphs.InsMapType
InsMap = Dict{Tuple{Int,Int},Array{UInt8,1}}

A sparse array of insertion sequences relative to a consensus. The key is the (locus(after),offset) of the insertion; the value is the sequence.

source
PanGraph.Graphs.SNPMapType
SNPMap = Dict{Int,UInt8}

A sparse array of single nucleotide polymorphisms relative to a consensus. The key is the locus of the mutation; the value is the modified nucleotide.

source

Functions

PanGraph.Graphs.consistency_checkMethod
consistency_check(G::Graph)

performs final consistency checks on the graph. Implemented checks for now are:

  • check 1-1 correspondence between gaps and insertion positions in block alignments.
source
PanGraph.Graphs.detransitive!Method
detransitive!(G::Graph)

Find and remove all transitive edges within the given graph. A transitive chain of edges is defined to be unambiguous: all sequences must enter on one edge and leave on another. Thus, this will not perform paralog splitting.

source
PanGraph.Graphs.finalize!Method
finalize!(G::Graph)

Compute the position of the breakpoints for each homologous alignment across all sequences within Graph G. Intended to be ran after multiple sequence alignment is complete

source
PanGraph.Graphs.graphsMethod
graphs(io::IO; circular=false)

Parse a fasta file from stream io and return an array of singleton graphs. If circular is unspecified, all genomes are assumed to be linear.

source
PanGraph.Graphs.keeponly!Method
keeponly!(G::Graph, names::String...)

Remove all sequences from graph G that are passed as variadic parameters names. This will marginalize a graph, i.e. return the subgraph that contains only isolates contained in names

source
PanGraph.Graphs.marshal_fastaMethod
marshal_fasta(io::IO, G::Graph; opt=nothing)

Serialize graph G as a fasta format output stream io. Importantly, this will only serialize the consensus sequences for each block and not the full multiple sequence alignment.

opt is currently ignored. It is kept for signature uniformity for other marshal functions

source
PanGraph.Graphs.marshal_jsonMethod
marshal_json(io::IO, G::Graph; opt=nothing)

Serialize graph G as a json format output stream io. This is the main storage/exported format for PanGraph. Currently it is the only format that can reconstruct an in-memory pangraph.

opt is currently ignored. It is kept for signature uniformity for other marshal functions

source
PanGraph.Graphs.prune!Method
prune!(G::Graph)

Remove all blocks from graph G that are not currently used by any extant sequence. Internal function used during guide tree alignment.

source
PanGraph.Graphs.purge!Method
purge!(G::Graph)

Remove all blocks from paths found in graph G that have zero length. Internal function used during guide tree alignment.

source
PanGraph.Graphs.realign!Method
realign!(G::Graph; accept)

Realign blocks contained within graph G. Usage of this function requires MAFFT to be on the system PATH accept should be a function that returns true on blocks you wish to realign. By default, all blocks are realigned.

source
PanGraph.Graphs.testFunction
test(path)

Align all sequences found in the fasta file at path into a pangraph. Verifies that after the alignment is complete, all sequences are correctly reconstructed

source
PanGraph.Graphs.unmarshalMethod
unmarshal(io::IO)

Deserialize the json formatted input stream io into a Graph data structure. Return a Graph type.

source
+end

Representation of a multiple sequence alignment. Alignments of homologous sequences are stored as blocks. A genome is stored as a path, i.e. a list of blocks.

source
PanGraph.Graphs.GraphMethod
Graph(name::String, sequence::Array{UInt8}; circular=false)

Creates a singleton graph from sequence. name is assumed to be a unique identifier. If circular is unspecified, the sequence is assumed to be linear.

source
PanGraph.Graphs.InsMapType
InsMap = Dict{Tuple{Int,Int},Array{UInt8,1}}

A sparse array of insertion sequences relative to a consensus. The key is the (locus(after),offset) of the insertion; the value is the sequence.

source
PanGraph.Graphs.SNPMapType
SNPMap = Dict{Int,UInt8}

A sparse array of single nucleotide polymorphisms relative to a consensus. The key is the locus of the mutation; the value is the modified nucleotide.

source

Functions

PanGraph.Graphs.consistency_checkMethod
consistency_check(G::Graph)

performs final consistency checks on the graph. Implemented checks for now are:

  • check 1-1 correspondence between gaps and insertion positions in block alignments.
source
PanGraph.Graphs.detransitive!Method
detransitive!(G::Graph)

Find and remove all transitive edges within the given graph. A transitive chain of edges is defined to be unambiguous: all sequences must enter on one edge and leave on another. Thus, this will not perform paralog splitting.

source
PanGraph.Graphs.finalize!Method
finalize!(G::Graph)

Compute the position of the breakpoints for each homologous alignment across all sequences within Graph G. Intended to be ran after multiple sequence alignment is complete

source
PanGraph.Graphs.graphsMethod
graphs(io::IO; circular=false)

Parse a fasta file from stream io and return an array of singleton graphs. If circular is unspecified, all genomes are assumed to be linear.

source
PanGraph.Graphs.keeponly!Method
keeponly!(G::Graph, names::String...)

Remove all sequences from graph G that are passed as variadic parameters names. This will marginalize a graph, i.e. return the subgraph that contains only isolates contained in names

source
PanGraph.Graphs.marshal_fastaMethod
marshal_fasta(io::IO, G::Graph; opt=nothing)

Serialize graph G as a fasta format output stream io. Importantly, this will only serialize the consensus sequences for each block and not the full multiple sequence alignment.

opt is currently ignored. It is kept for signature uniformity for other marshal functions

source
PanGraph.Graphs.marshal_jsonMethod
marshal_json(io::IO, G::Graph; opt=nothing)

Serialize graph G as a json format output stream io. This is the main storage/exported format for PanGraph. Currently it is the only format that can reconstruct an in-memory pangraph.

opt is currently ignored. It is kept for signature uniformity for other marshal functions

source
PanGraph.Graphs.prune!Method
prune!(G::Graph)

Remove all blocks from graph G that are not currently used by any extant sequence. Internal function used during guide tree alignment.

source
PanGraph.Graphs.purge!Method
purge!(G::Graph)

Remove all blocks from paths found in graph G that have zero length. Internal function used during guide tree alignment.

source
PanGraph.Graphs.realign!Method
realign!(G::Graph; accept)

Realign blocks contained within graph G. Usage of this function requires MAFFT to be on the system PATH accept should be a function that returns true on blocks you wish to realign. By default, all blocks are realigned.

source
PanGraph.Graphs.testFunction
test(path)

Align all sequences found in the fasta file at path into a pangraph. Verifies that after the alignment is complete, all sequences are correctly reconstructed

source
PanGraph.Graphs.unmarshalMethod
unmarshal(io::IO)

Deserialize the json formatted input stream io into a Graph data structure. Return a Graph type.

source
diff --git a/lib/mash/index.html b/lib/mash/index.html index 331932c9..2c01ff5e 100644 --- a/lib/mash/index.html +++ b/lib/mash/index.html @@ -2,4 +2,4 @@ Mash Implementation · PanGraph.jl

Mash Implementation

Types

PanGraph.Graphs.Mash.MinimizerType
struct Minimizer
     value    :: UInt64
     position :: UInt64
-end

A minimizer is a kmer that, given a hash function that maps kmers to integers, is the minimum kmer within a given set of kmers. The value is the result of applying the hash function to the kmer. The position is a bitpacked integer that includes reference ID, locus, and strand

source

Functions

PanGraph.Graphs.Mash.distanceMethod
distance(graphs...; k=15, w=100)

Compute the pairwise distance between all input graphs. Distance is the set distance between minimizers. Linear-time algorithm using hash collisions.

source
PanGraph.Graphs.Mash.hashMethod
hash(x::UInt64, mask::UInt64)

A transliteration of Jenkin's invertible hash function for 64 bit integers. Bijectively maps any kmer to an integer.

source
PanGraph.Graphs.Mash.sketchMethod
sketch(seq::Array{UInt8}, k::Int, w::Int, id::Int)

Sketch a linear sequence into a vector of minimizers. k sets the kmer size. w sets the number of contiguous kmers that will be used in the window minimizer comparison. id is a unique integer that corresponds to the sequence. It will be bitpacked into the minimizer position.

source
+end

A minimizer is a kmer that, given a hash function that maps kmers to integers, is the minimum kmer within a given set of kmers. The value is the result of applying the hash function to the kmer. The position is a bitpacked integer that includes reference ID, locus, and strand

source

Functions

PanGraph.Graphs.Mash.distanceMethod
distance(graphs...; k=15, w=100)

Compute the pairwise distance between all input graphs. Distance is the set distance between minimizers. Linear-time algorithm using hash collisions.

source
PanGraph.Graphs.Mash.hashMethod
hash(x::UInt64, mask::UInt64)

A transliteration of Jenkin's invertible hash function for 64 bit integers. Bijectively maps any kmer to an integer.

source
PanGraph.Graphs.Mash.sketchMethod
sketch(seq::Array{UInt8}, k::Int, w::Int, id::Int)

Sketch a linear sequence into a vector of minimizers. k sets the kmer size. w sets the number of contiguous kmers that will be used in the window minimizer comparison. id is a unique integer that corresponds to the sequence. It will be bitpacked into the minimizer position.

source
diff --git a/lib/minimap/index.html b/lib/minimap/index.html index aeff15f8..6cc9fc59 100644 --- a/lib/minimap/index.html +++ b/lib/minimap/index.html @@ -1,7 +1,7 @@ Minimap2 Wrapper · PanGraph.jl

Minimap2 Wrapper

Types

PanGraph.Minimap.BufferType
struct Buffer
 	handle :: Ptr{Cvoid}
-end

Store the untyped address to a minimap2 sequence buffer (working space).

source
PanGraph.Minimap.ExtraType
struct Extra
 	capacity :: UInt32
 	dp_score :: Int32
 	dp_max   :: Int32
@@ -9,9 +9,9 @@
 	packed   :: UInt32 # n_ambi (30 bits) / strand (2 bits)
 	n_cigar  :: UInt32
 	cigar UInt32[] (variable length array)
-end

Copied from minimap2.h. See mmextrat.

source
PanGraph.Minimap.IndexType
struct Index
 	handle :: Ptr{Cvoid}
-end

Store the untyped address to a minimap2 sequence index (set of minimizers).

source
PanGraph.Minimap.IndexOptionsType
struct IndexOptions
 	k    :: Cshort
 	w    :: Cshort
 	flag :: Cshort
@@ -20,7 +20,7 @@
 	batch_size :: UInt64
 
 	IndexOptions() = new()
-end

Copied from minimap2.h. See mmidxoptt.

source
PanGraph.Minimap.MapOptionsType
mutable struct MapOptions
 	flag :: Int64
 	seed :: Cint
 	sdust_thres :: Cint
@@ -70,7 +70,7 @@
 	cap_kalloc :: Int64
 
 	split_prefix :: Cstring
-end

Copied from minimap2.h. See mmmapoptt.

source
PanGraph.Minimap.RecordType
struct Record
 	id :: Int32; cnt :: Int32; rid :: Int32; score :: Int32
 	qs :: Int32; qe  :: Int32; rs  :: Int32; re :: Int32
 
@@ -85,4 +85,4 @@
 	hash :: UInt32
 	div  :: Cfloat
 	p    :: Ptr{Extra}
-end

Copied from minimap2.h. See mmreg1t.

source

Functions

PanGraph.Minimap.alignMethod
align(ref::PanContigs, qry::PanContigs, minblock::Int, preset::String)

Call into minimap to align the set of blocks qry to blocks ref. Preset should be a string ∈ ["asm5","asm10","asm20"]. See minimap2 manual for details. This is probably the function you want. If you call into the function specifically, all memory management is taken care of for you.

source
PanGraph.Minimap.makeindexMethod
makeindex(w, k, names, sequence; bucketbits::Int=14)

Given a window size w and kmer length k, return a handle to a minimizer index for sequences sequence.

source
+end

Copied from minimap2.h. See mmreg1t.

source

Functions

PanGraph.Minimap.alignMethod
align(ref::PanContigs, qry::PanContigs, minblock::Int, preset::String)

Call into minimap to align the set of blocks qry to blocks ref. Preset should be a string ∈ ["asm5","asm10","asm20"]. See minimap2 manual for details. This is probably the function you want. If you call into the function specifically, all memory management is taken care of for you.

source
PanGraph.Minimap.freebufferMethod
freebuffer()

Free memory associated to an opaque handle thread buffer

source
PanGraph.Minimap.freeindexMethod
freeindex()

Free memory associated to an opaque handle to a sequence index.

source
PanGraph.Minimap.makebufferMethod
makebuffer()

Return an opaque handle thread buffer

source
PanGraph.Minimap.makeindexMethod
makeindex(w, k, names, sequence; bucketbits::Int=14)

Given a window size w and kmer length k, return a handle to a minimizer index for sequences sequence.

source
diff --git a/lib/mmseqs/index.html b/lib/mmseqs/index.html index 36a46e01..551729ba 100644 --- a/lib/mmseqs/index.html +++ b/lib/mmseqs/index.html @@ -1,2 +1,2 @@ -MMseqs Wrapper · PanGraph.jl

MMseqs Wrapper

Types

Functions

PanGraph.MMseqs.alignMethod
align(ref::PanContigs, qry::PanContigs, klen::Int64)

Align homologous regions of qry and ref using mmseqs easy-search. klen tunes the kmer length. If klen=0 then mmseqs default is used. Returns the list of hits.

source
+MMseqs Wrapper · PanGraph.jl

MMseqs Wrapper

Types

Functions

PanGraph.MMseqs.alignMethod
align(ref::PanContigs, qry::PanContigs, klen::Int64)

Align homologous regions of qry and ref using mmseqs easy-search. klen tunes the kmer length. If klen=0 then mmseqs default is used. Returns the list of hits.

source
diff --git a/lib/node/index.html b/lib/node/index.html index 0d44009b..cfa0e1a8 100644 --- a/lib/node/index.html +++ b/lib/node/index.html @@ -2,4 +2,4 @@ Nodes · PanGraph.jl

Nodes

Types

PanGraph.Graphs.Nodes.NodeType
mutable struct Node{T}
 	block  :: T
 	strand :: Bool
-end

Node represents a portion of a sequence path that passes through a single block. strand stores whether we pass along the forward strand of block (if true) or reverse (if false).

source

Functions

Base.lengthMethod
length(n::Node) = length(n.block, n)

Return the length of sequence stored within node n

source
+end

Node represents a portion of a sequence path that passes through a single block. strand stores whether we pass along the forward strand of block (if true) or reverse (if false).

source
PanGraph.Graphs.Nodes.NodeMethod
Node{T}(b::T; strand=true)

Create a Node that passed through block b. Default to forward strand orientation.

source

Functions

Base.lengthMethod
length(n::Node) = length(n.block, n)

Return the length of sequence stored within node n

source
diff --git a/lib/pangraph/index.html b/lib/pangraph/index.html index 09cae996..5abbcabc 100644 --- a/lib/pangraph/index.html +++ b/lib/pangraph/index.html @@ -9,13 +9,13 @@ cigar::T divergence::Union{Float64,Nothing} align::Union{Float64,Nothing} -end

Alignment is a pairwise homologous alignment between two sequences.

source
PanGraph.HitType
mutable struct Hit
+end

Alignment is a pairwise homologous alignment between two sequences.

source
PanGraph.HitType
mutable struct Hit
 	name::String
 	length::Int
 	start::Int
 	stop::Int
 	seq::Maybe{Array{UInt8,1}}
-end

Hit is one side of a pairwise alignment between homologous sequences.

source
PanGraph.PanContigsType
struct PanContigs
+end

Hit is one side of a pairwise alignment between homologous sequences.

source
PanGraph.PanContigsType
struct PanContigs
 	name     :: T
 	sequence :: T
-end

A synonym for a consensus sequence of Block.

source

Functions

+end

A synonym for a consensus sequence of Block.

source

Functions

diff --git a/lib/path/index.html b/lib/path/index.html index 47f4f4b5..9938e75d 100644 --- a/lib/path/index.html +++ b/lib/path/index.html @@ -1,8 +1,8 @@ -Paths · PanGraph.jl

Paths

Types

PanGraph.Graphs.Paths.PathType
mutable struct Path
+Paths · PanGraph.jl

Paths

Types

PanGraph.Graphs.Paths.PathType
mutable struct Path
 	name     :: String
 	node     :: Array{Node{Block}}
 	offset   :: Union{Int,Nothing}
 	circular :: Bool
 	position :: Array{Int}
-end

Path is a single genome entry within the pangraph. name stores the unique identifier of the genome. node is an array of Nodes. The concatenation of all Nodes recapitulates the original sequence. offset is the circular shift that must be applied to the concatenation to retain the original starting positition. It is nothing if the Path is linear. circular is true only if the path should be considered circular, i.e. the last node is implictly connected to the first node. position represents the array of breakpoints each node corresponds to.

source
PanGraph.Graphs.Paths.PathMethod
Path(name::String,node::Node{Block};circular::Bool=false)

Return a new Path structure obtained from a single node and name name. By default will be interpreted as a linear path.

source

Functions

Base.lengthMethod
length(p::Path)

Return the number of nodes associated to Path p.

source
Base.replace!Method
replace!(p::Path, old::Array{Link}, new::Block)

Replace all instances of oriented Block list old with the single block new. Operates on Path p in place.

source
Base.replace!Method
replace!(p::Path, old::Block, new::Array{Block}, orientation::Bool)

Replace all instances of Block old with the array of blocks new. Operates on Path p in place. orientation is the relative orientation assumed between old and new, i.e. if it is false, new is assumed to be the reverse complement of old.

source
PanGraph.Graphs.sequenceMethod
sequence(p::Path; shift=true)

Return the reconstructed sequence of Path p. If shift is false, the circular offset will be ignored.

source
+end

Path is a single genome entry within the pangraph. name stores the unique identifier of the genome. node is an array of Nodes. The concatenation of all Nodes recapitulates the original sequence. offset is the circular shift that must be applied to the concatenation to retain the original starting positition. It is nothing if the Path is linear. circular is true only if the path should be considered circular, i.e. the last node is implictly connected to the first node. position represents the array of breakpoints each node corresponds to.

source
PanGraph.Graphs.Paths.PathMethod
Path(name::String,node::Node{Block};circular::Bool=false)

Return a new Path structure obtained from a single node and name name. By default will be interpreted as a linear path.

source

Functions

Base.lengthMethod
length(p::Path)

Return the number of nodes associated to Path p.

source
Base.replace!Method
replace!(p::Path, old::Array{Link}, new::Block)

Replace all instances of oriented Block list old with the single block new. Operates on Path p in place.

source
Base.replace!Method
replace!(p::Path, old::Block, new::Array{Block}, orientation::Bool)

Replace all instances of Block old with the array of blocks new. Operates on Path p in place. orientation is the relative orientation assumed between old and new, i.e. if it is false, new is assumed to be the reverse complement of old.

source
PanGraph.Graphs.sequenceMethod
sequence(p::Path; shift=true)

Return the reconstructed sequence of Path p. If shift is false, the circular offset will be ignored.

source
diff --git a/lib/simulate/index.html b/lib/simulate/index.html index 3e9457e0..b0eaad48 100644 --- a/lib/simulate/index.html +++ b/lib/simulate/index.html @@ -4,9 +4,9 @@ L :: Int σ :: Int rate :: Rates -end

Store all parameters of a single recombinative Wright-Fisher model. N is the population size. L is the expected genome size of all descendants. σ is the variance of genome size of all descendants. rate is the various rate of evolutionary processes.

source
PanGraph.Simulation.RatesType
struct Rates
+end

Store all parameters of a single recombinative Wright-Fisher model. N is the population size. L is the expected genome size of all descendants. σ is the variance of genome size of all descendants. rate is the various rate of evolutionary processes.

source
PanGraph.Simulation.RatesType
struct Rates
 	snp :: Float64
 	hgt :: Float64
 	del :: Float64
 	inv :: Float64
-end

Store the rates of evolution of mutation snp, recombination hgt, deletion del, and inversion inv.

source
PanGraph.Simulation.SequenceType
Sequence = Array{UInt64,1}

A bitpacked array of sequence state. Each UInt64 bits are interpreted as

30 bytes(ancestor) | 30 bytes (location) | 3 bytes (mutation) | 1 byte strand

source

Functions

PanGraph.Simulation.delete!Method
delete!(s::Sequence, from::Int, to::Int)

Delete the interval from:to from sequence s.

source
PanGraph.Simulation.insert!Method
insert!(acceptor::Sequence, donor::Sequence, at::Int)

Insert sequence donor into acceptor at locus at.

source
PanGraph.Simulation.invert!Method
invert!(s::Sequence, from::Int, to::Int)

Replace the interval from:to of sequence s with its reverse complement.

source
PanGraph.Simulation.modelMethod
model(param::Params)

Return an evolution function based upon parameters param.

source
PanGraph.Simulation.mutate!Method
mutate!(s::Sequence, at::Int)

Apply a random mutation to sequence s at locus at.

source
PanGraph.Simulation.nucleotideMethod
nucleotide(sequence::Array{Sequence}, ancestor::Array{Array{UInt8,1},1})

Generate the set of extant sequences from the ancestral mosiacs sequence and the original sequences ancestor.

source
PanGraph.Simulation.pancontig!Method
pancontig!(s::Sequence, ancestor::Dict{Int,Array{Interval}})

Return the ancestral tiling imprinted upon Sequence s. Modifies ancestor in place.

source
PanGraph.Simulation.pancontigsMethod
pancontigs(s::Sequence)

Return the ancestral tiling imprinted upon a set of Sequences isolates.

source
PanGraph.Simulation.runMethod
run(evolve!::Function, time::Int, initial::Array{Array{UInt8,1},1}; graph=false)

The high level API of the module. Evolves a set of initial sequences initial for time generations using the one-step evolution function evolve! If graph is true, the function will return the pangraph associated to the ancestral tiling.

source
+end

Store the rates of evolution of mutation snp, recombination hgt, deletion del, and inversion inv.

source
PanGraph.Simulation.SequenceType
Sequence = Array{UInt64,1}

A bitpacked array of sequence state. Each UInt64 bits are interpreted as

30 bytes(ancestor) | 30 bytes (location) | 3 bytes (mutation) | 1 byte strand

source

Functions

PanGraph.Simulation.delete!Method
delete!(s::Sequence, from::Int, to::Int)

Delete the interval from:to from sequence s.

source
PanGraph.Simulation.insert!Method
insert!(acceptor::Sequence, donor::Sequence, at::Int)

Insert sequence donor into acceptor at locus at.

source
PanGraph.Simulation.invert!Method
invert!(s::Sequence, from::Int, to::Int)

Replace the interval from:to of sequence s with its reverse complement.

source
PanGraph.Simulation.modelMethod
model(param::Params)

Return an evolution function based upon parameters param.

source
PanGraph.Simulation.mutate!Method
mutate!(s::Sequence, at::Int)

Apply a random mutation to sequence s at locus at.

source
PanGraph.Simulation.nucleotideMethod
nucleotide(sequence::Array{Sequence}, ancestor::Array{Array{UInt8,1},1})

Generate the set of extant sequences from the ancestral mosiacs sequence and the original sequences ancestor.

source
PanGraph.Simulation.pancontig!Method
pancontig!(s::Sequence, ancestor::Dict{Int,Array{Interval}})

Return the ancestral tiling imprinted upon Sequence s. Modifies ancestor in place.

source
PanGraph.Simulation.pancontigsMethod
pancontigs(s::Sequence)

Return the ancestral tiling imprinted upon a set of Sequences isolates.

source
PanGraph.Simulation.runMethod
run(evolve!::Function, time::Int, initial::Array{Array{UInt8,1},1}; graph=false)

The high level API of the module. Evolves a set of initial sequences initial for time generations using the one-step evolution function evolve! If graph is true, the function will return the pangraph associated to the ancestral tiling.

source
diff --git a/lib/utility/index.html b/lib/utility/index.html index 9eb7d148..78680449 100644 --- a/lib/utility/index.html +++ b/lib/utility/index.html @@ -3,7 +3,7 @@ seq::Array{UInt8} name::String meta::String -end

A record obtained when parsing a single entry of a FASTA file.

source
PanGraph.Graphs.Utility.ScoreType
struct Score <: AbstractArray{Float64,2}
+end

A record obtained when parsing a single entry of a FASTA file.

source
PanGraph.Graphs.Utility.ScoreType
struct Score <: AbstractArray{Float64,2}
 	rows::Int
 	cols::Int
 	band::NamedTuple{(:lower, :upper)}
@@ -11,7 +11,7 @@
 	offset::Array{Int}
 	starts::Array{Int}
 	stops::Array{Int}
-end

Store information about a banded pairwise alignment.

source
PanGraph.Graphs.Utility.costConstant
cost = (
+end

Store information about a banded pairwise alignment.

source
PanGraph.Graphs.Utility.costConstant
cost = (
 	open   = -6.0,
 	extend = -1.0,
 	band   = (
@@ -20,4 +20,4 @@
 	),
 	gap    = k -> k == 0 ? 0 : cost.open + cost.extend*(k-1),
 	match  = (c₁, c₂) -> 6.0*(c₁ == c₂) - 3.0,
-)

cost are the default dynamic alignment parameters used.

source

Functions

PanGraph.Graphs.Utility.alignMethod
align(seq₁::Array{UInt8}, seq₂::Array{UInt8}, cost::Score)

Perform a pairwise alignment using Needleman-Wunsch style dynamic programming between seq₁ and seq₂ given cost. The cost is defined by the Score structure.

source
PanGraph.Graphs.Utility.cigarMethod
cigar(seq₁::Array{UInt8}, seq₂::Array{UInt8})

Given two sequences, seq₁ and seq₂, perform a pairwise banded alignment and return the cigar string of alignment.

source
PanGraph.Graphs.Utility.columnsMethod
columns(s; nc=80)

Partition string s into an array of strings such that no string is longer than nc characters.

source
PanGraph.Graphs.Utility.enforce_cutoff!Method
enforce_cutoff!(a::Alignment, χ)

Ensure that the alignment a does not have insertion or deletion segments larger than χ. Return the list of segments created by parsing the alignment such that all segments are larger than χ.

source
PanGraph.Graphs.Utility.hamming_alignMethod
hamming_align(qry::Array{UInt8,1}, ref::Array{UInt8,1})

Perform a simple alignment of qry to ref by minimizing hamming distance. Useful for fast, approximate alignments of small sequences.

source
PanGraph.Graphs.Utility.random_idMethod
random_id(;len=10, alphabet=UInt8[])

Generate a random string of length len drawn from letters in alphabet.

source
PanGraph.Graphs.Utility.read_fastaMethod
read_fasta(io::IO)

Parse a FASTA file from IO stream io. Return an iterator over all records.

source
PanGraph.Graphs.Utility.read_mmseqs2Method
read_mmseqs2(io::IO)

Parse a simil-PAF file produced by mmseq2 from IO stream io. Return an iterator over all pairwise alignments.

source
PanGraph.Graphs.Utility.read_pafMethod
read_paf(io::IO)

Parse a PAF file from IO stream io. Return an iterator over all pairwise alignments.

source
PanGraph.Graphs.Utility.uncigarMethod
uncigar(cg::String)

Return an interator over intervals of alignment defined by cigar string cg.

source
PanGraph.Graphs.Utility.write_fastaMethod
write_fasta(io::IO, name, seq)

Output a single FASTA record with sequence seq and name name to IO stream io.

source
PanGraph.Graphs.reverse_complement!Method
reverse_complement!(hit::Hit)

Reverse complement the qry of Hit in place.

source
PanGraph.Graphs.reverse_complementMethod
reverse_complement(seq::Array{UInt8})

Return a newly allocated sequence array that is the reverse complement of seq.

source
+)

cost are the default dynamic alignment parameters used.

source

Functions

PanGraph.Graphs.Utility.alignMethod
align(seq₁::Array{UInt8}, seq₂::Array{UInt8}, cost::Score)

Perform a pairwise alignment using Needleman-Wunsch style dynamic programming between seq₁ and seq₂ given cost. The cost is defined by the Score structure.

source
PanGraph.Graphs.Utility.cigarMethod
cigar(seq₁::Array{UInt8}, seq₂::Array{UInt8})

Given two sequences, seq₁ and seq₂, perform a pairwise banded alignment and return the cigar string of alignment.

source
PanGraph.Graphs.Utility.columnsMethod
columns(s; nc=80)

Partition string s into an array of strings such that no string is longer than nc characters.

source
PanGraph.Graphs.Utility.enforce_cutoff!Method
enforce_cutoff!(a::Alignment, χ)

Ensure that the alignment a does not have insertion or deletion segments larger than χ. Return the list of segments created by parsing the alignment such that all segments are larger than χ.

source
PanGraph.Graphs.Utility.hamming_alignMethod
hamming_align(qry::Array{UInt8,1}, ref::Array{UInt8,1})

Perform a simple alignment of qry to ref by minimizing hamming distance. Useful for fast, approximate alignments of small sequences.

source
PanGraph.Graphs.Utility.random_idMethod
random_id(;len=10, alphabet=UInt8[])

Generate a random string of length len drawn from letters in alphabet.

source
PanGraph.Graphs.Utility.read_fastaMethod
read_fasta(io::IO)

Parse a FASTA file from IO stream io. Return an iterator over all records.

source
PanGraph.Graphs.Utility.read_mmseqs2Method
read_mmseqs2(io::IO)

Parse a simil-PAF file produced by mmseq2 from IO stream io. Return an iterator over all pairwise alignments.

source
PanGraph.Graphs.Utility.read_pafMethod
read_paf(io::IO)

Parse a PAF file from IO stream io. Return an iterator over all pairwise alignments.

source
PanGraph.Graphs.Utility.uncigarMethod
uncigar(cg::String)

Return an interator over intervals of alignment defined by cigar string cg.

source
PanGraph.Graphs.Utility.write_fastaMethod
write_fasta(io::IO, name, seq)

Output a single FASTA record with sequence seq and name name to IO stream io.

source
PanGraph.Graphs.reverse_complement!Method
reverse_complement!(hit::Hit)

Reverse complement the qry of Hit in place.

source
PanGraph.Graphs.reverse_complementMethod
reverse_complement(seq::Array{UInt8})

Return a newly allocated sequence array that is the reverse complement of seq.

source
diff --git a/search/index.html b/search/index.html index 25341bf7..9bd140e2 100644 --- a/search/index.html +++ b/search/index.html @@ -1,2 +1,2 @@ -Search · PanGraph.jl

Loading search...

    +Search · PanGraph.jl

    Loading search...

      diff --git a/tutorials/tutorial_1/index.html b/tutorials/tutorial_1/index.html index 5b1fa6f3..be8fe41b 100644 --- a/tutorials/tutorial_1/index.html +++ b/tutorials/tutorial_1/index.html @@ -21,4 +21,4 @@ }

      Each entry in path has two main properties: the name, corresponding to the sequence identifier in the input fasta file, and the blocks list. The latter is a representation of the genome as a list of blocks, each one identified by its unique id.

      Each entry in the blocks lists corresponds to a different block. Each block is assigned an unique random id composed of 10 capital letters and the consensus sequence of the block.

      More details on the structure of this json file will be covered in the next tutorial section.

      Sequence diversity and alignment sensitivity

      As discussed in our paper, two variables control the maximum diversity of homologous sequences that are merged in the same block: the sensitivity of the alignment kernel and the values of the pseudo-energy hyperparameters $\alpha$ and $\beta$.

      Pangraph can be run with two options for the alignment kernel:

      Moreover, as explained in our paper only matches with negative pseudo-energy are performed. The value of the pseudo-energy depends on two parameters:

      Therefore, as a rule of thumb:

      Note that, depending on the kmer size (-K argument) mmseqs2 can require several Gb of available memory to run.

      Exporting the pangraph

      The pangraph object can also be exported in other more common formats using the command export (see Export).

      pangraph export \
           --no-duplications \
           --output-directory ecoli_export \
      -    ecoli_pangraph.json

      This will create a folder named ecoli_export that contains two files.

      The latter can be visualized using Bandage. The option --no-duplications causes the export function to avoid including duplicated blocks in the graph representation (they are instead exported as isolated blocks). In our experience this results in a less "tangled" visual representation. Below is how the Bandage visualization of this example pangraph looks like. Blocks are colored by frequency, with common blocks (appearing in many different chromosomes) in red and rare blocks (appearing in only a few chromosomes) in black.

      img

      + ecoli_pangraph.json

      This will create a folder named ecoli_export that contains two files.

      The latter can be visualized using Bandage. The option --no-duplications causes the export function to avoid including duplicated blocks in the graph representation (they are instead exported as isolated blocks). In our experience this results in a less "tangled" visual representation. Below is how the Bandage visualization of this example pangraph looks like. Blocks are colored by frequency, with common blocks (appearing in many different chromosomes) in red and rare blocks (appearing in only a few chromosomes) in black.

      img

      diff --git a/tutorials/tutorial_2/index.html b/tutorials/tutorial_2/index.html index b3745834..684c76cd 100644 --- a/tutorials/tutorial_2/index.html +++ b/tutorials/tutorial_2/index.html @@ -81,4 +81,4 @@ G = json.load(fh) blocks = [{'length':len(x['sequence']), 'depth':len(x['positions'])} for x in G['blocks']] -plt.hist([x['depth'] for x in blocks], weights=[x['length'] for x in blocks], bins=range(1,11))
      +plt.hist([x['depth'] for x in blocks], weights=[x['length'] for x in blocks], bins=range(1,11))
      diff --git a/tutorials/tutorial_3/index.html b/tutorials/tutorial_3/index.html index 514e41a2..63c092f5 100644 --- a/tutorials/tutorial_3/index.html +++ b/tutorials/tutorial_3/index.html @@ -35,4 +35,4 @@ cp -r ../ecoli_export/vis/* public/dataset/Ecoli/ # start the server -npm start

      Once the server has started, the visualization can be accessed at http://localhost:8000/Ecoli.

      img

      Here is a summary of the information displayed in different panels:

      This visualization makes it easy to quickly explore the data. For example if we order the blocks by number of counts we find that the most duplicated block has 86 counts and consensus length of 768 bp, From panel D we can easily download the alignment and if we run BLAST on the sequence we discover that it contains the genetic sequence of a transposase.

      +npm start

      Once the server has started, the visualization can be accessed at http://localhost:8000/Ecoli.

      img

      Here is a summary of the information displayed in different panels:

      This visualization makes it easy to quickly explore the data. For example if we order the blocks by number of counts we find that the most duplicated block has 86 counts and consensus length of 768 bp, From panel D we can easily download the alignment and if we run BLAST on the sequence we discover that it contains the genetic sequence of a transposase.

      diff --git a/tutorials/tutorial_4/index.html b/tutorials/tutorial_4/index.html index e85deece..8682cd20 100644 --- a/tutorials/tutorial_4/index.html +++ b/tutorials/tutorial_4/index.html @@ -9,4 +9,4 @@ --no-duplications \ --output-directory klebs_export \ --prefix klebs_marginal_pangraph \ - klebs_marginal_pangraph.json

      This will produce the file klebs_export/klebs_marginal_pangraph.gfa which can be visualized with Bandage.

      img

      As expected the marginalized pangraph contains fewer blocks than the original one (388 vs 1244), and blocks are on average longer (mean length: 14 kbp vs 6 kbp). Blocks that appear in red are shared by both strains, while black blocks are present in only one of the two strains. The pangraph is composed of two stretches of syntenic blocks, which are in contact in a central point. This structure can be understood by comparing the two chromosomes with a dotplot. Using D-Genies on the two sequences[2] we obtain the following:

      img

      The contact point between the two loops in the pangraph is caused by the fact that the two genomes are composed of two mostly syntenic subsequences (the two loops) but these loops are concatenated with two different strandedness in the two strains. If we were to draw the two paths (relative to the two chromosomes) with different colors on top of the pangraph we would observe something similar to this:

      img

      + klebs_marginal_pangraph.json

      This will produce the file klebs_export/klebs_marginal_pangraph.gfa which can be visualized with Bandage.

      img

      As expected the marginalized pangraph contains fewer blocks than the original one (388 vs 1244), and blocks are on average longer (mean length: 14 kbp vs 6 kbp). Blocks that appear in red are shared by both strains, while black blocks are present in only one of the two strains. The pangraph is composed of two stretches of syntenic blocks, which are in contact in a central point. This structure can be understood by comparing the two chromosomes with a dotplot. Using D-Genies on the two sequences[2] we obtain the following:

      img

      The contact point between the two loops in the pangraph is caused by the fact that the two genomes are composed of two mostly syntenic subsequences (the two loops) but these loops are concatenated with two different strandedness in the two strains. If we were to draw the two paths (relative to the two chromosomes) with different colors on top of the pangraph we would observe something similar to this:

      img

      diff --git a/tutorials/tutorial_5/index.html b/tutorials/tutorial_5/index.html index c02d9d2f..a771a946 100644 --- a/tutorials/tutorial_5/index.html +++ b/tutorials/tutorial_5/index.html @@ -1,4 +1,4 @@ Example application: plasmid rearrangements · PanGraph.jl

      Example application: plasmid rearrangements

      These next two tutorials cover other possible applications of pangraph and processing the downstream output.

      Although pangraph was developed with whole genomes in mind, it can be applied to other situations. In this tutorial, we will explore the structural diversity in some closely-related plasmids from a hospital outbreak. Pangraph can be applied to these 'as if' they were whole genomes.

      Preliminary steps

      This tutorial uses a dataset of five closely-related plasmids. They were analysed previously by Sheppard et al. (2016) in a paper studying an outbreak of carbapenem-resistant bacteria in a hospital in Virginia, USA. These plasmids are all similar to an index plasmid from the hospital, but have some structural changes. We will show how pangraph output can be used to visualize this structural diversity.

      You can download these sequences by running:

      wget https://github.com/liampshaw/pangraph-tutorials/raw/main/data/sheppard/UVA01_plasmids.fa.gz

      Building the pangraph and exporting it for visualization is done with these commands (should be very quick as we are using plasmids, which are much smaller than whole genomes):

      pangraph build --circular UVA01_plasmids.fa.gz > UVA01_plasmids_pangraph.json
       pangraph export --edge-minimum-length 0 UVA01_plasmids_pangraph.json -p UVA01_plasmids_pangraph -o ./

      We use --edge-minimum-length 0 because we want to see all blocks.

      Default visualization

      As before, we can visualize the output file UVA01_plasmids_pangraph.gfa with Bandage.

      img

      Here, the node colour represents the depth of the blocks. However, it is difficult from this visualization to understand individual paths through the graph.

      Improving the visualization

      We can use some custom scripts to look at representations of the plasmids alongside their pangraph. These scripts are not part of pangraph but are an example of how to process the output into visualizations. You can download them by running:

      wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/prepare-pangraph-gfa.py
      -wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks-UVA01.R

      First, we run a script to generate random colours for the blocks.

      python prepare-pangraph-gfa.py UVA01_plasmids.pangraph.gfa --all

      This produces three files:

      • ${input}.blocks.csv - dataset of genome and block start/end positions, with block colours (hex codes)
      • ${input}.colours.csv - blocks with colours (hex codes). The --all flag in the python script means we give all blocks a colour, even those which only appear once.
      • ${input}.coloured.gfa - a gfa with the same block colours added as an extra field

      Then we call Bandage programmaticaly with the custom colours option on the coloured gfa:

      Bandage image UVA01_plasmids_pangraph.gfa.coloured.gfa pangraph.png --height 2000 --width 2500 --colour custom --depth --lengths --fontsize 30 --nodewidth 8

      We can now look at a linear representation of the plasmids alongside this graph. The details are in the script, but we basically represent the plasmids linearly and choose a random block to 'anchor' them all to the same place.

      Rscript plot-blocks-UVA01.R UVA01_plasmids_pangraph.gfa.blocks.csv pangraph.png pangraph_linear_vs_graph.pdf

      This produces a version[1] where we can see the different 'walks' taken through the graph by each plasmid. For ease of understanding, on the left-hand side we have added simplified labels of the blocks (a-f) since some of the blocks are very small and hard to make out otherwise.

      img

      Note that the plasmids are circular: in the linear representation, in two cases block 'a' appears on the left rather than on the right, but the structure is identical.

      Sheppard et al. previously analysed these plasmids and made a table of the structural changes they identified with respect to the reference plasmid (Table 1 in their paper). We can see that the structural changes they identified are interpretable from the visualization we have made:

      AccessionSpeciesDateLength (bp)Structural differences (Sheppard)Corresponding block(s) (pangraph)
      NZ_CP011575.1K. pneumoniaeMar-201143,621
      NZ_CP011582.1E. cloacaeAug-201243,433188-bp deletionf
      NZ_CP011598.1K. intermediaSep-200943,621
      NZ_CP011608.1C. freundiiNov-201044,8461,225-bp insertionc
      NZ_CP011656.1C. freundiiOct-2012129,19614,960-bp duplication and 70,615-bp insertione,f,a[2] and g
      • 1your version will probably be different due to the random layout chosen by Bandage. You can always rerun the Bandage command multiple times until you get a nice result.
      • 2The 14,960-bp duplication is represented as a repetition of e,f,a in the bottom plasmid. The total length is 5,045+188+9,727=14,960 as expected.
      +wget https://raw.githubusercontent.com/liampshaw/pangraph-tutorials/main/scripts/plot-blocks-UVA01.R

      First, we run a script to generate random colours for the blocks.

      python prepare-pangraph-gfa.py UVA01_plasmids.pangraph.gfa --all

      This produces three files:

      Then we call Bandage programmaticaly with the custom colours option on the coloured gfa:

      Bandage image UVA01_plasmids_pangraph.gfa.coloured.gfa pangraph.png --height 2000 --width 2500 --colour custom --depth --lengths --fontsize 30 --nodewidth 8

      We can now look at a linear representation of the plasmids alongside this graph. The details are in the script, but we basically represent the plasmids linearly and choose a random block to 'anchor' them all to the same place.

      Rscript plot-blocks-UVA01.R UVA01_plasmids_pangraph.gfa.blocks.csv pangraph.png pangraph_linear_vs_graph.pdf

      This produces a version[1] where we can see the different 'walks' taken through the graph by each plasmid. For ease of understanding, on the left-hand side we have added simplified labels of the blocks (a-f) since some of the blocks are very small and hard to make out otherwise.

      img

      Note that the plasmids are circular: in the linear representation, in two cases block 'a' appears on the left rather than on the right, but the structure is identical.

      Sheppard et al. previously analysed these plasmids and made a table of the structural changes they identified with respect to the reference plasmid (Table 1 in their paper). We can see that the structural changes they identified are interpretable from the visualization we have made:

      AccessionSpeciesDateLength (bp)Structural differences (Sheppard)Corresponding block(s) (pangraph)
      NZ_CP011575.1K. pneumoniaeMar-201143,621
      NZ_CP011582.1E. cloacaeAug-201243,433188-bp deletionf
      NZ_CP011598.1K. intermediaSep-200943,621
      NZ_CP011608.1C. freundiiNov-201044,8461,225-bp insertionc
      NZ_CP011656.1C. freundiiOct-2012129,19614,960-bp duplication and 70,615-bp insertione,f,a[2] and g
      diff --git a/tutorials/tutorial_6/index.html b/tutorials/tutorial_6/index.html index 479fe127..b418e85a 100644 --- a/tutorials/tutorial_6/index.html +++ b/tutorials/tutorial_6/index.html @@ -12,4 +12,4 @@ Rscript plot-blocks.R \ pangraph_kpc_u10k_d5k.gfa.blocks.csv \ $geneBlock pangraph_kpc_u10k_d5k.gfa.png \ - pangraph_kpc_plot.pdf

      img

      If you pick a genome on the left of the plot, you should be able to follow its path through the graph representation on the right using the colours.[4] The block starting at position 0 is the KPC-block.

      + pangraph_kpc_plot.pdf

      img

      If you pick a genome on the left of the plot, you should be able to follow its path through the graph representation on the right using the colours.[4] The block starting at position 0 is the KPC-block.