Skip to content

MurrellGroup/SeqUMAP.jl

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SeqUMAP.jl

Alignment-free embedding of sequences via kmer-counting and UMAP.jl.

Setup

From the Julia REPL:

using Pkg
Pkg.add(url="https://github.com/MurrellGroup/SeqUMAP.jl")

Usage

Import sequences from a file by running

seqnames, seqs = read_fasta("my_sequences.fa");

Optionally remove gaps and convert 'U' => 'T' with

seqs_clean = clean_nt.(seqs; to_strip = ['-']);

Then, obtain a SeqUMAP projection, using the default settings, with

proj = sequmap(seqs_clean, 2)

The full list of options are:

proj = sequmap(seqs_clean, ndim; 
    k = 5, lookup_dic = NT_DICT, pca = true, pca_maxoutdim = 5, 
    n_neighbors = 12, min_dist = 0.7, repulsion_strength = 0.1, 
    metric = Distances.SqEuclidean(), umap_kwargs = Pair{Symbol,Any}[]
    )

where seqs is an array of strings, ndim is the number of projected dimensions and k is the kmer size. If using nucleotide sequences with an {'A', 'T', 'C', 'G'} alphabet, pass SeqUMAP.NT_DICT as the character lookup. If pca is specified, PCA will be run on kmer vectors prior to UMAP embedding with pca_maxoutdim components.

UMAP parameters n_neighbors, min_dist, repulsion_strength and metric can also be specified, with the default options shown here. If you want to pass UMAP additional keyword arguments, you can do so by providing them in umap_kwargs.

Both AA and NT dictionaries are included for sequence encoding. For AA, k=2 is usually sufficient due to the larger alphabet. For NT, start with k=5 or k=6.

AA_DICT = Dict(
    'A' => 1, 'C' => 2, 'D' => 3, 'E' => 4, 'F' => 5, 
    'G' => 6, 'H' => 7, 'I' => 8, 'K' => 9, 'L' => 10, 
    'M' => 11, 'N' => 12, 'P' => 13, 'Q' => 14, 'R' => 15, 
    'S' => 16, 'T' => 17, 'V' => 18, 'W' => 19, 'Y' => 20, 
    '*' => 21, 'X' => 22
    )

NT_DICT = Dict('A' => 1, 'C' => 2, 'G' => 3, 'T' => 4)

Embeddings can be easily visualized with PyPlot or other common Julia plotting libraries.

using PyPlot

fig, ax = subplots()
ax.scatter(proj[1, :], proj[2, :]; s = 2.0, linewidth = 0.0)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Languages

  • Julia 100.0%