Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLM: Plug values don't respect standardization #16543

Open
tomasfryda opened this issue Feb 12, 2025 · 0 comments
Open

GLM: Plug values don't respect standardization #16543

tomasfryda opened this issue Feb 12, 2025 · 0 comments
Labels

Comments

@tomasfryda
Copy link
Contributor

It appears that Plug Values are not standardized even when standardize=True is set which I think should be considered a bug - it makes it really hard to impute, e.g., median since the user would first have to standardize the frame and then calculate the median value from that without knowing if the standardization yielded the same values since we do it internally and I don't think we show the values anywhere).

Steps to reproduce

h2o_iris <- as.h2o(iris)

glm_model_plug_values_standardized <- h2o.glm(y="Sepal.Length",
                                              training_frame=h2o_iris,
                                              missing_values_handling="PlugValues",
                                              plug_values=as.h2o(data.frame(
                                                Sepal.Width=0,
                                                Petal.Length=0,
                                                Petal.Width=0,
                                                Species="versicolor")),
                                              standardize = TRUE)
# Standardized model
glm_model_plug_values_standardized@model$coefficients
# Intercept     Species.setosa Species.versicolor  Species.virginica        Sepal.Width       Petal.Length        Petal.Width 
# 1.5834912          0.5716142          0.0000000         -0.2420645          0.5199133          0.7813752         -0.3135125 
h2o_iris[,c(-4,-5)]
#   Sepal.Length Sepal.Width Petal.Length
# 1          5.1         3.5          1.4
# 2          4.9         3.0          1.4
# 3          4.7         3.2          1.3
# 4          4.6         3.1          1.5
# 5          5.0         3.6          1.4
# 6          5.4         3.9          1.7

# Intercept + b_SW*SW + b_PL*PL  (and -0.3135125 * 0 (0 is the plug value; otherwise it would be 0.2) + 0 for versicolor)
abs(1.5834912+0.5199133*3.5+0.7813752*1.4 - predict(glm_model_plug_values_standardized, h2o_iris[,c(-4,-5)])[1]) # => 4.067362e-08

Obviously, Petal.Width == 0 is an outlier and as such it should influence the GLM a lot but since there is no standardization of plug values it gets interpreted as already standardized value and does not contribute to the prediction at all (multiplication of β by zero).

@tomasfryda tomasfryda added the bug label Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant