PySpark2PMML

Python package for converting Apache Spark ML pipelines to PMML.

Features

This package is a thin PySpark wrapper for the JPMML-SparkML library.

Prerequisites

Apache Spark 3.0.X, 3.1.X, 3.2.X, 3.3.X, 3.4.X or 3.5.X.
Python 2.7, 3.4 or newer.

Installation

Install a release version from PyPI:

pip install pyspark2pmml

Alternatively, install the latest snapshot version from GitHub:

pip install --upgrade git+https://github.com/jpmml/pyspark2pmml.git

Configuration and usage

PySpark2PMML must be paired with JPMML-SparkML based on the following compatibility matrix:

Apache Spark version	JPMML-SparkML branch	Latest JPMML-SparkML version
3.0.X	`2.0.X`	2.0.3
3.1.X	`2.1.X`	2.1.3
3.2.X	`2.2.X`	2.2.3
3.3.X	`2.3.X`	2.3.2
3.4.X	`2.4.X`	2.4.1
3.5.X	`master`	2.5.0

Launch PySpark; use the --packages command-line option to specify the coordinates of relevant JPMML-SparkML modules:

org.jpmml:pmml-sparkml:${version} - Core module.
org.jpmml:pmml-sparkml-lightgbm:${version} - LightGBM via SynapseML extension module.
org.jpmml:pmml-sparkml-xgboost:${version} - XGBoost via XGBoost4J-Spark extension module.

Launching core:

pyspark --packages org.jpmml:pmml-sparkml:${version}

Fitting a Spark ML pipeline:

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import RFormula

df = spark.read.csv("Iris.csv", header = True, inferSchema = True)

formula = RFormula(formula = "Species ~ .")
classifier = DecisionTreeClassifier()
pipeline = Pipeline(stages = [formula, classifier])
pipelineModel = pipeline.fit(df)

Exporting the fitted Spark ML pipeline to a PMML file:

from pyspark2pmml import PMMLBuilder

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

The representation of individual Spark ML pipeline stages can be customized via conversion options:

from pyspark2pmml import PMMLBuilder

classifierModel = pipelineModel.stages[1]

pmmlBuilder = PMMLBuilder(sc, df, pipelineModel) \
	.putOption(classifierModel, "compact", False) \
	.putOption(classifierModel, "estimate_featureImportances", True)

pmmlBuilder.buildFile("DecisionTreeIris.pmml")

License

PySpark2PMML is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.

If you would like to use PySpark2PMML in a proprietary software project, then it is possible to enter into a licensing agreement which makes PySpark2PMML available under the terms and conditions of the BSD 3-Clause License instead.

Additional information

PySpark2PMML is developed and maintained by Openscoring Ltd, Estonia.

Interested in using Java PMML API software in your company? Please contact [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
pyspark2pmml		pyspark2pmml
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PySpark2PMML

Features

Prerequisites

Installation

Configuration and usage

License

Additional information

About

Releases

Packages

Languages

License

jpmml/pyspark2pmml

Folders and files

Latest commit

History

Repository files navigation

PySpark2PMML

Features

Prerequisites

Installation

Configuration and usage

License

Additional information

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages