-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Off-by-One Category Error in PMML Conversion #141
Comments
Your JPMML library stack is nicely up-to-date, but the XGBoost4J stack is lagging behing (
I would have also suggested to look into the (one hot-) encoding of categorical features as the most likely culprit, but you already figured it out yourself. If you want to keep debugging, then simply insert some logging statements around this place - print out the number of features after each pipeline step: I'm currently performing a major version upgrade across all JPMML libraries (migrating from JDK 8 to JDK 11). As part of the upgrade, I'll add proper logging capabilities into the framework. I've been bitten by a lack thereof myself.
I'll try to reproduce the reported behaviour locally. This is exactly the configuration that I will be needing. One thing about this pipeline is that your estimator step is XGBoost, which has native missing value support. There is a slight possibility that the JPMML-XGBoost library (or its current integration with JPMML-SparkML) is performing an "extra filtering" to drop one binary indicator level that it thinks is representing missing values. Does your pipeline work if you substitute XGBoost with a "regular" Apache SparkML decision tree estimator, such as DecisionTreeRegressor or RandomForestRegressor? These estimators don't "suffer from" such extra missing value support, so they might succeed (as in the number of JPMML features and Apache SparkML features is the same). TLDR: If you replace XGBoost with Apache SparkML's built-in GradientBoostingEstimator, does this particular error disappear? That would suggest that the missing category level is related to missing values. If the error persists, then I'd need to investigate |
I'm testing with this example code: The conversion keeps succeeding when making the following changes to it:
However, the conversion starts to fail after this change:
So, the quick fix is to avoid |
Hey @tristers-at-square - two questions about your use case:
Looking at this other example, I remembered that there is a See this example: Here's If you take your pipeline, and re-configure it as follows, then perhaps it works:
TLDR: Replace |
My quick experimentation around this idea suggests that Looks like this issue needs to be fixed "classically", by ensuring that The |
To make debugging easier, I created some synthetic data with the following properties:
My StringIndexerModel uses handleInvalid set to keep. My OneHotEncoderModel uses handleInvalid set to keep and dropLast set to true. The PMML conversion process thinks there are only 15 features and produces an out of bounds exception when the XGBoost model is converted. I would have, also, expected there to only be 15 features based on the settings (3 numeric features, 3+4+5=12 total categorical features). I would have thought you'd get one extra category to handle unseen values and that dropLast would remove the last category. However, I am confused by the behavior of dropLast on SparkML's OneHotEncoder. It doesn't actually seem to drop the last column (at least not when handleInvalid is set to True). The SparkML native pipeline model produces 18 features (3 numeric features and 4+5+6=15 categorical features). I am not sure if that's an issue with the SparkML library itself or if it is my understanding of dropLast / handleInvalid that is incorrect. It seems, then, that the PMML converter is doing the "right" thing in some sense. The problem is that I am using a PySpark preprocessing pipeline to preprocess my data. This data gets stored somewhere (like S3). Then a separate process trains the XGBoost model on the already pre-processed data. However, the XGBoost model is seeing 18 features in the dataset because that's what the native SparkML pipeline produces. I am not using Spark4j to train the model, but rather just the core XGBoost library in distributed mode. So my guess here is that the PMML conversion process would work if I was using Spark4j in conjunction with the SparkML preprocessing pipeline. For the record, the PMML conversion works when I set handleInvalid=False and dropLast=True, but I absolutely need handleInvalid=True. |
Also tested this with handleInvalid=error and dropLast. The number of features produced by the preprocessing pipeline is now 12 (3 numeric features, 9 categorical features). This behavior is seemingly inconsistent in SparkML. If dropLast=True and handleInvalid=keep, num categorical features = num known categories + 1 unknown category If dropLast=True and handleInvalid=error, num categorical features = num known categories - 1 dropped category When I think it should be (and the PMML converter also seems to think it should be): num categorical features = num known categories + 1 unknown category - 1 dropped category |
We have triangulated the cause of this issue down to the The JPMML-SparkML library currently does not use the value of this property (it only uses the I'll fix it, and make a new release by the end of this week. Will also introduce some unit test to cover typical
I haven't tested this in code but, conceptually, I would argue that if your pipeline already contains Similarly, if you do However, I must check with
Very interesting. You generate the corresponding pieces of PMML (one for pre-processing part, and another one for the XGBoost part), and then stitch them together? Do you have a custom tool for that? The JPMML software project does not provide such "stitch PMMLs together" tool right now. Might add one, even if only for educational purposes. |
Hello,
I am trying to convert a SparkML pipeline into PMML. It contains the following stages:
During the PMML conversion process, it throws this error:
To debug, I took the SparkML pipeline (in it's native format and without the XGBoost model stage at the end) and transformed a row of data. The number of output features is 13076, not 12993 as the PMML conversion process seems to think. I dug into the data and realized that there are exactly 83 categorical features. 13076 - 83 = 12993. This suggests the invalid category is somehow not being correctly handled somewhere.
My OneHotEncoder class uses handleInvalid=keep and dropLast=True. My StringIndexer uses handleInvalid=keep as well.
I am using the following libraries:
Any thoughts on what could be wrong?
I also tried to create a PMML of just the stages before the XGBoostClassificationModel in order to debug this, but JPMML seems to have some smart optimizations that essentially turns the StringIndexer, VectorAssembler, and SparseToDenseTransformer stages into no-ops. Otherwise, I would have PMML-ized just those stages and checked to see what the cardinality up to that stage is before it is seen by the XGBoost model.
The text was updated successfully, but these errors were encountered: