Skip to content

Workflow

Gavin Douglas edited this page Apr 5, 2021 · 44 revisions

Below is an overview of the PICRUSt2 workflow, which includes example commands for processing 16S sequencing data and getting E.C. number and KEGG ortholog (KO) abundances. The E.C. numbers can then be used to calculate MetaCyc pathway abundances and coverages. Note that there are other gene family databases supported which may be more informative (but which cannot be collapsed to pathways by default). See the side-bar for more details on individual commands.

Note that you can type the option -h to get a description of each below script.

The entire pipeline can be run with this command (details):

picrust2_pipeline.py -s study_seqs.fna -i study_seqs.biom -o picrust2_out_pipeline -p 1

If you would like to run each step individually you can also do that using the below commands. Using these commands is useful when you're running into problems using picrust2_pipeline.py and want to isolate an issue or if you only want to re-run part of the PICRUSt2 pipeline.

Place amplicon sequence variants (or OTUs) into reference phylogeny (details)

place_seqs.py -s study_seqs.fna -o placed_seqs.tre -p 1 \
              --intermediate placement_working

Run hidden-state prediction to get 16S copy numbers, E.C. number, and KO abundances per predicted genome (details).

Note that NSTI values will be added to the 16S prediction table (since the -n option was given).

hsp.py -i 16S -t placed_seqs.tre -o marker_nsti_predicted.tsv.gz -p 1 -n

hsp.py -i EC -t placed_seqs.tre -o EC_predicted.tsv.gz -p 1

hsp.py -i KO -t placed_seqs.tre -o KO_predicted.tsv.gz -p 1

Predict E.C. and KO abundances in sequencing samples (adjusts gene family abundances by 16S sequence abundance) (details)

metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f EC_predicted.tsv.gz \
                       -o EC_metagenome_out


metagenome_pipeline.py -i study_seqs.biom \
                       -m marker_nsti_predicted.tsv.gz \
                       -f KO_predicted.tsv.gz \
                       -o KO_metagenome_out

Infer MetaCyc pathway abundances and coverages based on predicted E.C. number abundances (details)

pathway_pipeline.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz \
                    -o pathways_out \
                    --intermediate pathways_working \
                    -p 1

Add descriptions as new column in gene family and pathway abundance tables (details)

add_descriptions.py -i EC_metagenome_out/pred_metagenome_unstrat.tsv.gz -m EC \
                    -o EC_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i KO_metagenome_out/pred_metagenome_unstrat.tsv.gz -m KO \
                    -o KO_metagenome_out/pred_metagenome_unstrat_descrip.tsv.gz

add_descriptions.py -i pathways_out/path_abun_unstrat.tsv.gz -m METACYC \
                    -o pathways_out/path_abun_unstrat_descrip.tsv.gz

Shuffling predictions

An optional additional step is to shuffle the ASV labels in the genome prediction tables (i.e. the outputs of hsp.py). Any analyses based on these shuffled tables can then be compared with analyses based on the actual data to check if there is more signal in the unshuffled data. See here for more details.

Clone this wiki locally