Releases: tschuelia/PyPythia
v1.2.1
v1.2.0
This release includes the following changes:
New Features
Includes a new command line option --forceDuplicates
that forces Pythia to predict the difficulty for an MSA that contains duplicate sequences (default behavior still is to fail).
Updates
We retrained Pythia and optimized the params using Optuna. This slightly increases the performance to a MAE of 0.06 (previously 0.07) and a MAPE of 1.6% (previously 1.7%) 🥳
The new predictor is available as predictor_lgb_v1.2.0.pckl
and latest.pckl
.
Bug Fixes
Fixes a bug that caused the --shap
option to fail due to an update in the shap
package API (fixes #15), thanks @computations for the fix!
v1.1.4
v1.1.3
v1.1.2
Fix issues with shap package
Importing the shap module takes some time, so it will now only be imported if --shap
is called.
Also, shap raises NumbaDeprecationWarnings that clutter the output of Pythia, so I suppressed them for now until shap fixes this (see shap/shap#2909)
v1.1.1
v1.1.0
We trained Pythia on even more data! Our new, way larger set of training data consists of:
- 11 108 DNA MSAs
- 979 Protein MSAs
- 460 Morphological MSAs
= 12 547 MSAs, all empirical data of course :-)
The new predictor shows an improved accuracy 🥳
- Mean absolute error: 0.07 (previously 0.09)
- Mean absolute percentage error: 1.7% (previously 2.5%)
This new Pythia prediction 1.1.0 is available as predictors/predictor_lgb_v1.1.0.pckl
and will replace the last version in predictors/latest.pckl
Changes
- The new retrained predictor will be the default predictor, so
predictors/latest.pckl
is identical topredictors/predictor_lgb_v1.1.0.pckl
. The previous predictors of Pythia < 1.1.0 are still available and fully supported. - Pythia is trained on two additional features: the patterns-over-site ratio and a an entropy-like measurement based on the number and frequency of patterns in the MSA
- Pythia now supports parallel inference of the parsimony trees with RAxML-NG. You can set the number of threads using the new command line parameter
--threads
. Note that you need RAxML-NG version ≥ 1.2.0 to use the--threads
option.
Introducing Shapley Values (experimental feature)
To allow more detailed insights into the prediction of Pythia, we include shapley values with this version. To get more information on what shapley values are and how to interpret them, refer to the wiki. The new command line parameter --shap
will create a so-called waterfall plot and save it as {msa_name}.shap.pdf
. Please make sure you understand what shapley values are and what you can infer based on this plot before drawing conclusions!
This new feature is fully backwards compatible with all previous predictors.
v1.0.1
New features:
- allow manual setting of MSA file format
- include difficulty prediction script that requires no installation
Minor Bug fixes:
- fix LightGBM issue when using Python multiprocessing
- use the user defined precision for printing features in verbose mode
- fix issues with logging when using PyPythia from code
v1.0.0
Release Summary
We retrained Pythia using additional data and now include full support of morphological data 🎉
Our new set of training data consists of:
- 3250 empirical DNA and Protein datasets obtained from TreeBase (same as in version 0.0.1)
- 538 additional empirical DNA and Protein datasets obtained via our RAxML-Grove
- 474 additional morphological datasets obtained from TreeBase
- = 4262 datasets in total
The resulting predictor has about the same accuracy as the previous predictor, with a slight improvement of the mean absolute percentage error:
- Mean absolute error: 0.09
- Mean absolute percentage error: 2.5%
We are now using LightGBM’s boosted trees instead of scikit-learn’s random forest
- Pythia 1.0.0 is backwards compatible to the scikit-learn random forest predictor of Pythia version 0.0.1. This predictor is still available in
predictors/predictor_sklearn_rf_v0.0.1.pckl
Breaking Changes
- The default predictor changed to the new LightGBM predictor (
predictors/predictor_lgb_v1.0.0.pckl
). Since this predictor was retrained using additional data, the predictions between previous versions and this version will likely differ. This introduces an additional dependency: LightGBM - Identical sequences in the MSA:
- per default: Pythia refuse to predict the difficulty for MSAs that contain identical sequences
- new
--removeDuplicates
option: if the MSA contains duplicate sequences Pythia stores a reduced alignment and predicts the difficulty for this reduced alignment
- The exceptions in
msa.py
changed: instead ofValueError
, Pythia now raises a customPyPythiaException
. - We changed the
DataType
type definition to an Enum instead of a string, seecustom_types.py
for more details. - We renamed the
predictor_path
parameter inpredictor.DifficulyPredictor
topredictor_handle
.
Minor Changes
- Improved logging for command line interface
- new
--quiet
mode to suppress intermediate information predictor.DifficulyPredictor
now accepts a set of features in it's constructor, allowing predictions with experimental difficulty predictors that were trained using a different set of features than our PyPythia
Full Changelog: 0.0.1...1.0.0