Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-by-One Category Error in PMML Conversion #141

Open
tristers-at-square opened this issue Feb 3, 2025 · 7 comments
Open

Off-by-One Category Error in PMML Conversion #141

tristers-at-square opened this issue Feb 3, 2025 · 7 comments

Comments

@tristers-at-square
Copy link

tristers-at-square commented Feb 3, 2025

Hello,

I am trying to convert a SparkML pipeline into PMML. It contains the following stages:

  1. SQLTransformer
  2. StringIndexer
  3. OneHotEncoder
  4. VectorAssembler
  5. SparseToDenseTransformer
  6. XGBoostClassificationModel

During the PMML conversion process, it throws this error:

java.lang.IndexOutOfBoundsException: Index: 13022, Size: 12993
at java.util.ArrayList.rangeCheck(ArrayList.java:659)
at java.util.ArrayList.get(ArrayList.java:435)
at org.jpmml.converter.Schema.getFeature(Schema.java:141)
at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:285)
at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:435)
at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:435)
at org.jpmml.xgboost.RegTree.encodeNode(RegTree.java:435)
at org.jpmml.xgboost.RegTree.encodeTreeModel(RegTree.java:267)
at org.jpmml.xgboost.ObjFunction.createMiningModel(ObjFunction.java:155)
at org.jpmml.xgboost.BinomialLogisticRegression.encodeMiningModel(BinomialLogisticRegression.java:46)
at org.jpmml.xgboost.Classification.encodeMiningModel(Classification.java:76)
at org.jpmml.xgboost.GBTree.encodeMiningModel(GBTree.java:189)
at org.jpmml.xgboost.Learner.encodeMiningModel(Learner.java:596)
at org.jpmml.sparkml.xgboost.BoosterUtil.encodeBooster(BoosterUtil.java:119)
at org.jpmml.sparkml.xgboost.XGBoostClassificationModelConverter.encodeModel(XGBoostClassificationModelConverter.java:41)
at org.jpmml.sparkml.xgboost.XGBoostClassificationModelConverter.encodeModel(XGBoostClassificationModelConverter.java:29)
at org.jpmml.sparkml.ModelConverter.registerModel(ModelConverter.java:108)
at org.jpmml.sparkml.PMMLBuilder.build(PMMLBuilder.java:125)
at org.jpmml.sparkml.PMMLBuilder.buildFile(PMMLBuilder.java:298)
...

To debug, I took the SparkML pipeline (in it's native format and without the XGBoost model stage at the end) and transformed a row of data. The number of output features is 13076, not 12993 as the PMML conversion process seems to think. I dug into the data and realized that there are exactly 83 categorical features. 13076 - 83 = 12993. This suggests the invalid category is somehow not being correctly handled somewhere.

My OneHotEncoder class uses handleInvalid=keep and dropLast=True. My StringIndexer uses handleInvalid=keep as well.

I am using the following libraries:

  • pmml-model-1.6.11
  • pmml-model-metro-1.6.11
  • pmml-converter-1.5.12
  • xgboost4j_2.12-1.3.1
  • xgboost4j-spark_2.12-1.3.1
  • pmml-sparkml-2.4.3
  • pmml-xgboost-1.8.3
  • pmml-sparkml-xgboost-2.3.1

Any thoughts on what could be wrong?

I also tried to create a PMML of just the stages before the XGBoostClassificationModel in order to debug this, but JPMML seems to have some smart optimizations that essentially turns the StringIndexer, VectorAssembler, and SparseToDenseTransformer stages into no-ops. Otherwise, I would have PMML-ized just those stages and checked to see what the cardinality up to that stage is before it is seen by the XGBoost model.

@vruusmann
Copy link
Member

I am using the following libraries:

Your JPMML library stack is nicely up-to-date, but the XGBoost4J stack is lagging behing (1.3 vs the latest 2.1). In know, this is beyond your ability to change, but if you could update to XGBoost 2+, then you could go for native XGBost categorical splits, and get rid of the OneHotEncoder plus SparseToDenseTransformer steps of the pipeline, which should cut resource usage by a lot.

The number of output features is 13076, not 12993 as the PMML conversion process seems to think. I dug into the data and realized that there are exactly 83 categorical features. 13076 - 83 = 12993.

I would have also suggested to look into the (one hot-) encoding of categorical features as the most likely culprit, but you already figured it out yourself.

If you want to keep debugging, then simply insert some logging statements around this place - print out the number of features after each pipeline step:
https://github.com/jpmml/jpmml-sparkml/blob/2.4.3/pmml-sparkml/src/main/java/org/jpmml/sparkml/PMMLBuilder.java#L119
https://github.com/jpmml/jpmml-sparkml/blob/master/pmml-sparkml/src/main/java/org/jpmml/sparkml/SparkMLEncoder.java#L144

I'm currently performing a major version upgrade across all JPMML libraries (migrating from JDK 8 to JDK 11). As part of the upgrade, I'll add proper logging capabilities into the framework. I've been bitten by a lack thereof myself.

My OneHotEncoder class uses handleInvalid=keep and dropLast=True. My StringIndexer uses handleInvalid=keep as well.

I'll try to reproduce the reported behaviour locally. This is exactly the configuration that I will be needing.

One thing about this pipeline is that your estimator step is XGBoost, which has native missing value support. There is a slight possibility that the JPMML-XGBoost library (or its current integration with JPMML-SparkML) is performing an "extra filtering" to drop one binary indicator level that it thinks is representing missing values.

Does your pipeline work if you substitute XGBoost with a "regular" Apache SparkML decision tree estimator, such as DecisionTreeRegressor or RandomForestRegressor? These estimators don't "suffer from" such extra missing value support, so they might succeed (as in the number of JPMML features and Apache SparkML features is the same).

TLDR: If you replace XGBoost with Apache SparkML's built-in GradientBoostingEstimator, does this particular error disappear? That would suggest that the missing category level is related to missing values. If the error persists, then I'd need to investigate StringIndexer and OneHotEncoder combined behaviour.

@vruusmann
Copy link
Member

I'm testing with this example code:
https://github.com/jpmml/jpmml-sparkml/blob/2.4.3/pmml-sparkml-xgboost/src/test/resources/XGBoostAudit.scala

The conversion keeps succeeding when making the following changes to it:

  • indexer.setHandleInvalid("keep") (line 22)
  • ohe.setDropLast(true) (line 23)

However, the conversion starts to fail after this change:

  • ohe.setHandleInvalid("keep") (line 23)

So, the quick fix is to avoid OneHotEncoder#setHandleInvalid("keep").

@vruusmann
Copy link
Member

Hey @tristers-at-square - two questions about your use case:

  • Does the dataset contain missing values?
  • What is the purpose of setting OneHotEncoder#setHandleInvalid("keep")? Is it there to make the OHE step accept missing values? In other words, is the "handleInvalid" in your pipeline about supporting missing values, and really not about invalid values per se ("invalid value" - a value not seen in the training dataset)?

Looking at this other example, I remembered that there is a org.jpmml.sparkml.feature.InvalidCategoryTransformer pseudo-transformer class available. Its designed to take a StringIndexer#setHandleInvalid("keep") output column, and then restore missing values as true missing values (eg. Double#NaN) so that XGBoost estimator would identify them correctly.

See this example:
https://github.com/jpmml/jpmml-sparkml/blob/2.4.3/pmml-sparkml-xgboost/src/test/resources/XGBoostAuditNA.scala

Here's InvalidCategoryTransformer usage:
https://github.com/jpmml/jpmml-sparkml/blob/2.4.3/pmml-sparkml-xgboost/src/test/resources/XGBoostAuditNA.scala#L24-L25

If you take your pipeline, and re-configure it as follows, then perhaps it works:

  1. StringIndexer, with setHandleInvalid("keep") - replaces missing values with __unknown pseudo-category level.
  2. OneHotEncoder with setHandleInvalid("error") and setDropLast(false) - accepts __unknown pseudo-category, and passes it on (instead of dropping it from the last position).
  3. InvalidCategoryTransformer - drops the __unknown pseudo-category, thereby restoring missing values to their original Double#NaN representation, which is the most suitable for feeding into XGBoost.
  4. VectorAssembler etc, as usual

TLDR: Replace OneHotEncoder#setDropLast(true) with an InvalidCategoryTransformer step.

@vruusmann
Copy link
Member

TLDR: Replace OneHotEncoder#setDropLast(true) with an InvalidCategoryTransformer step.

My quick experimentation around this idea suggests that InvalidCategoryTransformer is designed to operate on nominal columns such as StringIndexer output columns. It currently rejects non-nominal input columns such as OneHotEncoder output columns.

Looks like this issue needs to be fixed "classically", by ensuring that OHE#handleInvalid attribute is properly dealt with.

The InvalidCategoryTransformer will come in handy when working with XGBoost 2.X (accepts StringIndexer output columns without explicit one hot-encoding).

@tristers-at-square
Copy link
Author

To make debugging easier, I created some synthetic data with the following properties:

  • 3 numeric columns.
  • 3 categorical columns with 3, 4, and 5 known categories, respectively.
  • 1 binary label.

My StringIndexerModel uses handleInvalid set to keep. My OneHotEncoderModel uses handleInvalid set to keep and dropLast set to true.

The PMML conversion process thinks there are only 15 features and produces an out of bounds exception when the XGBoost model is converted. I would have, also, expected there to only be 15 features based on the settings (3 numeric features, 3+4+5=12 total categorical features). I would have thought you'd get one extra category to handle unseen values and that dropLast would remove the last category.

However, I am confused by the behavior of dropLast on SparkML's OneHotEncoder. It doesn't actually seem to drop the last column (at least not when handleInvalid is set to True). The SparkML native pipeline model produces 18 features (3 numeric features and 4+5+6=15 categorical features). I am not sure if that's an issue with the SparkML library itself or if it is my understanding of dropLast / handleInvalid that is incorrect.

It seems, then, that the PMML converter is doing the "right" thing in some sense.

The problem is that I am using a PySpark preprocessing pipeline to preprocess my data. This data gets stored somewhere (like S3). Then a separate process trains the XGBoost model on the already pre-processed data. However, the XGBoost model is seeing 18 features in the dataset because that's what the native SparkML pipeline produces. I am not using Spark4j to train the model, but rather just the core XGBoost library in distributed mode. So my guess here is that the PMML conversion process would work if I was using Spark4j in conjunction with the SparkML preprocessing pipeline.

For the record, the PMML conversion works when I set handleInvalid=False and dropLast=True, but I absolutely need handleInvalid=True.

@tristers-at-square
Copy link
Author

tristers-at-square commented Feb 5, 2025

Also tested this with handleInvalid=error and dropLast. The number of features produced by the preprocessing pipeline is now 12 (3 numeric features, 9 categorical features). This behavior is seemingly inconsistent in SparkML.

If dropLast=True and handleInvalid=keep,

num categorical features = num known categories + 1 unknown category

If dropLast=True and handleInvalid=error,

num categorical features = num known categories - 1 dropped category

When I think it should be (and the PMML converter also seems to think it should be):

num categorical features = num known categories + 1 unknown category - 1 dropped category

@vruusmann
Copy link
Member

The number of features produced by the preprocessing pipeline is now 12 (3 numeric features, 9 categorical features). This behavior is seemingly inconsistent in SparkML.

We have triangulated the cause of this issue down to the OneHotEncoderhandleInvalid = "keep" option.

The JPMML-SparkML library currently does not use the value of this property (it only uses the dropLast property):
https://github.com/jpmml/jpmml-sparkml/blob/2.4.3/pmml-sparkml/src/main/java/org/jpmml/sparkml/feature/OneHotEncoderModelConverter.java#L97-L110

I'll fix it, and make a new release by the end of this week. Will also introduce some unit test to cover typical StringIndexer and OneHotEncoder combinations, fitted with on a wide range of Apache Spark versions (I wouldn't be surprised if the behaviour changes between 3.X versions).

For the record, the PMML conversion works when I set handleInvalid=False and dropLast=True, but I absolutely need handleInvalid=True.

I haven't tested this in code but, conceptually, I would argue that if your pipeline already contains StringIndexer#handleInvalid = "keep", then any OneHotEncoder#handleInvalid = "keep" that follows it is effectively a no-op. The explanation is that StringIndexer catches all missing and invalid values, and maps them to a __unknown pseudo-category. The subsequent OneHotEncoder only gets to see the __unknown value in its input. It never sees a missing value or any other out-of-training domain value.

Similarly, if you do OneHotEncoder#handleInvalid = "keep" followed by OneHotEncoder#dropLast = true, then you'd be creating an extra __unknown pseudo-category, which gets dropped right thereafter.

However, I must check with OneHotEncoder source code if there is a special treatment for __unknown pseudo-category values. Maybe they evalde the dropLast command (and maybe it's Apache Spark version dependent).

I am using a PySpark preprocessing pipeline to preprocess my data. Then a separate process trains the XGBoost model on the already pre-processed data.

Very interesting. You generate the corresponding pieces of PMML (one for pre-processing part, and another one for the XGBoost part), and then stitch them together? Do you have a custom tool for that?

The JPMML software project does not provide such "stitch PMMLs together" tool right now. Might add one, even if only for educational purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants