Python package for converting Apache Spark ML pipelines to PMML.
This package is a thin PySpark wrapper for the JPMML-SparkML library.
- Apache Spark 3.0.X, 3.1.X, 3.2.X, 3.3.X, 3.4.X or 3.5.X.
- Python 2.7, 3.4 or newer.
Install a release version from PyPI:
pip install pyspark2pmml
Alternatively, install the latest snapshot version from GitHub:
pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.git
PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:
Apache Spark version | JPMML-SparkML branch | Latest JPMML-SparkML version |
---|---|---|
3.0.X | 2.0.X |
2.0.3 |
3.1.X | 2.1.X |
2.1.3 |
3.2.X | 2.2.X |
2.2.3 |
3.3.X | 2.3.X |
2.3.2 |
3.4.X | 2.4.X |
2.4.1 |
3.5.X | master |
2.5.0 |
Launch PySpark; use the --packages
command-line option to specify the coordinates of relevant JPMML-SparkML modules:
org.jpmml:pmml-sparkml:${version}
- Core module.org.jpmml:pmml-sparkml-lightgbm:${version}
- LightGBM via SynapseML extension module.org.jpmml:pmml-sparkml-xgboost:${version}
- XGBoost via XGBoost4J-Spark extension module.
Launching core:
pyspark --packages org.jpmml:pmml-sparkml:${version}
Fitting a Spark ML pipeline:
from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula
df = spark.read.csv("Iris.csv", header = True, inferSchema = True)
formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)
Exporting the fitted Spark ML pipeline to a PMML file:
from pyspark2pmml import PMMLBuilder
pmmlBuilder = PMMLBuilder(sc, df, pipelineModel)
pmmlBuilder.buildFile("DecisionTreeIris.pmml")
The representation of individual Spark ML pipeline stages can be customized via conversion options:
from pyspark2pmml import PMMLBuilder
classifierModel = pipelineModel.stages[1]
pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
.putOption(classifierModel, "compact", False) \
.putOption(classifierModel, "estimate_featureImportances", True)
pmmlBuilder.buildFile("DecisionTreeIris.pmml")
PySpark2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use PySpark2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes PySpark2PMML available under the terms and conditions of the BSD 3-Clause License instead.
PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact [email protected]