-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion on Warning when Scoring with Unimportant Features Excluded with GLM model #16363
Comments
Good point you have there. When the GLM coefficients for unimportant columns are zero, it should not need to have those columns present in the scoring dataset. To help you get around this problem for the time being, please just add back those columns with random numbers so that the code will not complain. As an alternative, you can use makeGLMModel to use include the four useful predictors and predictor coefficients and build a new GLM model with the 4 coefficients and then save it as a mojo that way. I will provide you with code later on how to do this. This is not an easy issue to fix. The reason is we have an base object that deals with all kinds of models (GBM, GLM, GAM, DL, etc). For the other models, they don't have a concept of GLM coefficients. So, they will always take user inputs of predictors and response and build a model for it. Again, some base objects are used to do the mojo which will read all models, their predictor names, response. So, when you just include the useful GLM predictors, it will freak out and say where are the other predictors and throw an error. Thanks, Wendy |
Here is the complete code on how to build your model with many predictors and then generate a mojo with only the important predictors. Here are my steps:
Here is the complete code.import sys X, y = make_friedman1(n_samples = 10000, n_features = 10, random_state=8451) model1 = glm() # model that use all coefficients model2 = glm() # model that only uses 5 predictors because I pretend the other predictors are useless and have coeff = 0 coef_model1 = model1.coef() # grab all coefficients from model1 model_with_good_predictors = glm.makeGLMModel(model=model2, coefs=coeff_dict) # generate model with only 5 predictors and the coefficient values are from full model X, y = make_friedman1(n_samples = 100, n_features = 5, random_state=8452) generic_mojo_glm_from_file = H2OGenericEstimator.from_file(glm_mojo_model) # load mojo as generic model for ind in range(hdf_test.nrows): # if you check the contents of the two prediction frames, they should be the same. |
Can you work with the code I sent you? |
H2O version, Operating System and Environment
I am running H2O on Databricks with the following cluster settings:
and the following version:
Description
We are training an H2O GeneralizedLinearEstimator model on a dataframe that has 100 columns and only 4 of which are actually used to compute
y
, the remaining features are independent ofy
(aka unimportant) by using the following code to generate the data:Since this is a model that does internal variable selection, we explored which features were actually deemed important by the model using
model.varimp()
and were curious what would happen when we scored using a subset of the data with only those relevant columns. When scoring only using the 4 columns that were relevant, we received the following warning message for all other columns that were deemed irrelevant:Expected Behavior
We would assume H2O would not need to append back the unimportant columns, but rather score on the subset of data (i.e. 4 vs. 100 columns) for speed and cost efficiencies.
Steps to reproduce
Here is the code used to reproduce this warning message:
%pip install --quiet h2o
Upload logs
Output of
h2o.download_all_logs()
h2ologs_20240813_061208.zipThe text was updated successfully, but these errors were encountered: