pangrowth
is an efficient tool designed for genomic researchers to predict the openness of a pangenome and
estimate the core genome size. This tool is capable of analyzing fasta
sequences using k-mers, as well as other genomic elements such as genes, CDS,
ORFs, provided as either a frequency histogram or a pan-matrix (with columns
representing genomes and rows representing items; see panmatrix_ecoli_n50.txt
for an example).
- k-mer counting: utilizes a modified version of yak to count k-mers
- growth/core calculation: computes the exact expected genomic growth/core size quadratically in the number of genomes
Parmigiani, L., Wittler, R., Stoye, J.,: Revisiting pangenome openness with k-mers. PCI Comp & Biol. (2024).
git clone https://gitlab.ub.uni-bielefeld.de/gi/pangrowth.git
cd pangrowth
make
To plot the results we need the following python libraries: numpy, pandas, matplotlib, scipy and searbon. You can install them with:
pip install -r dependencies.txt
./pangrowth hist -k 17 -t 12 data/fa/*.fna.gz > hist.txt
pangrowth
also accepts a file containing a list of fasta files (each one on a single line) passed with the paremeter-i fasta_list.txt
To visualize the histogram:
python scripts/plot_hist.py hist.txt hist.pdf
If you have multiple histograms that you want to compare with different number of genomes you can use:
python scripts/plot_hist.py --norm_x --norm_y=both hist.txt data/hist_ecoli_n50.txt data/hist_ecoli_n200.txt hist_multiple.pdf
- The flag
--norm_x
normalize the x-axis to be between (0,1]. - The flag
--norm_y
allows two types of normalization:multiplicity
which adjusts each histogram value h[i] multiplying it by its index i (i.e., h[i] * i, this means that values appearing once remain the same, values appearing twice are doubled, and so on)percentage
which divides the values of h[i] by the total sum of h (its total sum equals 1) The--norm_y=both
applies both in series.
./pangrowth growth -h data/hist_ecoli_n50.txt > growth.txt
#./pangrowth growth -p data/panmatrix_ecoli_n50.txt > growth.txt
To fit the openness and visualize the growth:
python scripts/plot_growth.py growth.txt growth.pdf
We can again pass multiple growth files to scripts/plot_growth.py
to
compare with other species.
python scripts/plot_growth.py growth.txt data/growth_ecoli_n200.txt growth_multiple.pdf
./pangrowth core -h data/hist_ecoli_n50.txt > core.txt
#./pangrowth core -p data/panmatrix_ecoli_n50.txt > core.txt
./pangrowth core -h data/hist_ecoli_n50.txt -q 0.9 > core_q90.txt
- The
-q
takes a quorum to considered the item in the core (default 1.0).
To fit the core genome and report the percentage of core item over the expected genome size:
python scripts/plot_core.py core_q90.txt data/core_q90_ecoli_n200.txt core.pdf
The expected genome size is calculated as the total sum of the histogram divided by the number of genomes.