Skip to content

Instructions on using isolationWorkflow

schyen edited this page Nov 20, 2018 · 4 revisions

This outlines how to use the isolationWorkflow:

  1. Prepare sequences for Blastn
  2. Blast your fasta file
  3. Read in blast results, and compile into table
  4. Generate/update strain library and fasta file
    • Creates/updates strain library with unique isolates
    • Creates fasta file by species for ones that need sequence alignment (amalgamates strains from current blast result and strain library)
  1. Align redundant species and update strain library

Getting started

Once you have downloaded the isolationgWorkflow package, you need to load the package before using it.

library(isolationWorkflow)

Once loaded, you can use ? to find out more about each function in the package.

?abif_fasta()
?parse_blast()

1) Prepare sequences for Blast

This step includes:

  • Check if sequences were successful, and allows you to exlude failed sequences
  • Trim ends off of sequences, as they are usually strings of Ns
  • Make a fasta file containing all sequences from your Big Dye plate so can send in one Blastn query

All of this is done using the abif_fasta(). I recommend running this command twice. The first time I use to evaluate all of my sequences, and the second time I generate a "clean" fasta file.

This will give you:

  • a file containing sequences before and after trimming
  • a fasta file containing trimmed sequences (if output = FALSE) Note: all output files saved in same location as sanger files

When evaluating sequences, look for the following:

  • sequence length -- too short means the sequence failed (i.e. DNA not released from boiling, no PCR product or BigDye failed)
  • distribution of Ns -- Ns throughout the majority of the read means the isolate was not pure
  • satisfaction with the trimming algorithm -- compare the raw vs. trimmmed version of the sequence to ensure you agree with the trimming. If you don't, manually edit the final FASTA file before Blasting

# Check for sequence quality, and trimming
abif_fasta(folder = "/path/to/sanger/files/",
           exclude = NULL,
           trim=TRUE, trim.check=FALSE,
           export.check=TRUE, show.prog=TRUE, 
           output=FALSE)

# prepare sequences for Blast
## exclude parameter accepts multiple entries to exclude multiple samples
abif_fasta(folder = "/path/to/sanger/files/",
           exclude = c('sample1.ab1','sample2.ab1'),
           trim=TRUE, trim.check=FALSE,
           export.check=TRUE, show.prog=TRUE, 
           output='myIsolates-clean.FASTA')

2) Blast your sequences

Use Blastn to classify your sequences. Setting the database to 16S ribosomal RNA sequences (Bacteria and Archaea) helps with specificity of the search. Download your blast result as an XML file and save in same location as sanger files. Then using the parse_blast(), we can read our results and save as a csv table. Output is saved in same location as your downloaded xml file.


3) Read in blast results

# include path to blast result if necessary
parse_blast(filename='/path/to/downloaded/blast/result/######-Alignment.xml',
            ngroups=1, tophit=FALSE,
            output='blast_result.csv')

4) Building a strain library

You can use gen_strainlib() to both generate and update your strain library, depending on how the parameters are used. Regardless, when you use gen_strainlib() the following happens:

  • reads in blast result file from parse blast
  • if no strain library supplied, creates a new one
  • Species that are only found once and not already in strain library are added to library
  • Generates strain library as csv
  • Generates strain library as fasta
  • Generates fasta file for each species when more than one isolate is assigned that species

Regarding the last point, the blast result being examined will be cross referenced with the strain library. So you can compare and align your blast results with strains already in your library. These species-based fasta files allow you to align the sequences that are classified to the same species (see next step)

Note: a fasta file of your library strains is also generated because that is a useful thing to have.

# generating strain library
gen_strainlib(blast_file = "/include/path/if/necessary/blast_result.csv",
               lib_path = "/path/to/strain/library", 
               lib_file = NULL, lib_name = 'mystrainlibrary')

# updating strain library
gen_strainlib(blast_file = "/include/path/if/necessary/blast_result.csv",
               lib_path = "/path/to/strain/library", 
               lib_file = "mystrainlibrary.csv", lib_name = NULL)

5) Align redundant species and update strain library

This steps aligns the sequences of isolates that were classified as the same species, and allows you to pick which isolates you want to add to the library. Alignment is done using AlignSeqs() in the R package, DECIPHER.

The alignment will automatically pop up as an HTML file in your web browser, with the nucleotides colour-coded to help identify any mis-matches.The alignemnt file is automatically saved in the same location as the fasta file that was aligned. When choosing which strains to keep, consider the following:

  • is the difference due to an N? if so, then you don't have enough information to say the sequences are different at that base
  • a strain that is different from others, ONLY because of Ns may have poor sequence quality and you should consider redoing the sequencing for that isolate.
# supply more than one fasta to align to save time
choose_alignment(lib_path = "/path/to/strain/library", 
                 lib_file = 'mystrainlibrary.csv', 
                 blast_file = "/include/path/if/necessary/blast_result.csv",
                 fasta_path = "/path/to/species/fasta/files",
                 fastatoalign = c("species1.fasta",
                               "species2.fasta",
                               "species3.fasta"))

Concluding remarks

Since isolation results usually comes back in waves, it is helpful to be checking your results as you go, and freezing strains down as needed (rather than waiting for all results to come back, then deciding whichs strains to keep). This workflow is designed so that you are able to examine sanger sequences as they come in, and easily cross-reference them with the strains you have already decided to keep in your strain library.

Note, the final strain library file is in the same format as the blast result. This may have extra info you may not find useful in your final strain library. I recommend you don't modify the format, but rather, copy and paste into a seperate file to edit as you please.