Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion on Warning when Scoring with Unimportant Features Excluded with GLM model #16363

Open
ANNIKADAHLMANN-8451 opened this issue Aug 13, 2024 · 3 comments
Labels

Comments

@ANNIKADAHLMANN-8451
Copy link

H2O version, Operating System and Environment
I am running H2O on Databricks with the following cluster settings:

  • Single User Cluster
  • 13.3 LTS (Apache Spark 3.4.1, Scala 2.12)

and the following version:

  • H2O: 3.46.0.4

Description
We are training an H2O GeneralizedLinearEstimator model on a dataframe that has 100 columns and only 4 of which are actually used to compute y, the remaining features are independent of y (aka unimportant) by using the following code to generate the data:

from sklearn.datasets import make_friedman1
import pandas as pd

X, y = make_friedman1(n_samples=1000, n_features=100, random_state=8451)
df = pd.DataFrame(X)
df['y'] = y

df.head()

Since this is a model that does internal variable selection, we explored which features were actually deemed important by the model using model.varimp() and were curious what would happen when we scored using a subset of the data with only those relevant columns. When scoring only using the 4 columns that were relevant, we received the following warning message for all other columns that were deemed irrelevant:

/local_disk0/.ephemeral_nfs/envs/pythonEnv-4602915f-61f7-4bcd-8c98-f8e3a654a43b/lib/python3.10/site-packages/h2o/job.py:81: UserWarning: Test/Validation dataset is missing column '5': substituting in a column of NaN
  warnings.warn(w)

Expected Behavior
We would assume H2O would not need to append back the unimportant columns, but rather score on the subset of data (i.e. 4 vs. 100 columns) for speed and cost efficiencies.

Steps to reproduce
Here is the code used to reproduce this warning message:

%pip install --quiet h2o

from sklearn.datasets import make_friedman1
import pandas as pd

X, y = make_friedman1(n_samples=1000, n_features=100, random_state=8451)
df = pd.DataFrame(X)
df['y'] = y

hdf = h2o.H2OFrame(df)
import h2o
h2o.init()
import h2o
from h2o.estimators import H2OGeneralizedLinearEstimator

predictors = hdf.columns
response = "y"
predictors.remove(response)

model = H2OGeneralizedLinearEstimator()
model.train(x=predictors, y=response, training_frame=hdf)
from sklearn.datasets import make_friedman1
import pandas as pd

X_tst, y_tst = make_friedman1(n_samples=1000, n_features=100, random_state=8452) # intentionally using a different random state to generate a different sample
tst = pd.DataFrame(X_tst)
tst['y'] = y_tst

X_tst_subset = tst[[0, 1, 3, 4]] # relevant features as revealed by model.varimp()
X_tst_subset_hf = h2o.H2OFrame(X_tst_subset)
subset_preds = mojo_model.predict(X_tst_subset_hf) # THIS IS WHAT TRIGGERS THE WARNING MESSAGE FOR EACH UNIMPORTANT COLUMN

Upload logs
Output of h2o.download_all_logs() h2ologs_20240813_061208.zip

@ANNIKADAHLMANN-8451 ANNIKADAHLMANN-8451 changed the title Scoring with Scoring with Unimportant Features Excluded with GLM model Aug 13, 2024
@ANNIKADAHLMANN-8451 ANNIKADAHLMANN-8451 changed the title Scoring with Unimportant Features Excluded with GLM model Confusion on Warning when Scoring with Unimportant Features Excluded with GLM model Aug 13, 2024
@wendycwong
Copy link
Contributor

@ANNIKADAHLMANN-8451

Good point you have there. When the GLM coefficients for unimportant columns are zero, it should not need to have those columns present in the scoring dataset.

To help you get around this problem for the time being, please just add back those columns with random numbers so that the code will not complain.

As an alternative, you can use makeGLMModel to use include the four useful predictors and predictor coefficients and build a new GLM model with the 4 coefficients and then save it as a mojo that way. I will provide you with code later on how to do this.

This is not an easy issue to fix. The reason is we have an base object that deals with all kinds of models (GBM, GLM, GAM, DL, etc). For the other models, they don't have a concept of GLM coefficients. So, they will always take user inputs of predictors and response and build a model for it.

Again, some base objects are used to do the mojo which will read all models, their predictor names, response. So, when you just include the useful GLM predictors, it will freak out and say where are the other predictors and throw an error.

Thanks, Wendy

@wendycwong
Copy link
Contributor

wendycwong commented Sep 4, 2024

@ANNIKADAHLMANN-8451

Here is the complete code on how to build your model with many predictors and then generate a mojo with only the important predictors. Here are my steps:

  1. Generate data (copied from your code)
  2. Generate h2o model with 10 features (yours is 100 columns) as model1
  3. grab the coefficients of only 5 predictors (I pretend the other 5 are useless and they are zero)
  4. generate a new H2O model with only the 5 predictors grabbed from model1. I have to first generate a new glm model with only 5 predictors (model2) and then call makeGLMModel to generate a GLM model with the correct coefficients (model_with_good_predictors)
  5. save model_with_good_predictors to mojo;
  6. generate new test dataset with only 5 predictors;
  7. load mojo as generic model and generate prediction with new test dataset. You can use mojo predict. I use generic model to make it easy to compare prediction result.
  8. generate prediction with model_with_good_predictors
  9. compare the prediction from 7 and 8 and they should be the same.

Here is the complete code.

import sys
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator as glm
from sklearn.datasets import make_friedman1
import pandas as pd
import tempfile
from h2o.estimators import H2OGenericEstimator

X, y = make_friedman1(n_samples = 10000, n_features = 10, random_state=8451)
df = pd.DataFrame(X)
df['y'] = y
hdf = h2o.H2OFrame(df)
predictors = hdf.columns
response = "y"
predictors.remove(response)

model1 = glm() # model that use all coefficients
model1.train(x=predictors, y=response, training_frame=hdf)

model2 = glm() # model that only uses 5 predictors because I pretend the other predictors are useless and have coeff = 0
model2.train(x=["0", "1", "2", "3", "4"], y=response, training_frame=hdf)

coef_model1 = model1.coef() # grab all coefficients from model1
coeff_dict = {"0": coef_model1["0"], "1": coef_model1["1"], "2": coef_model1["2"], "3": coef_model1["3"],
"4": coef_model1["4"], "Intercept": coef_model1["Intercept"]} # grab coeffs we care

model_with_good_predictors = glm.makeGLMModel(model=model2, coefs=coeff_dict) # generate model with only 5 predictors and the coefficient values are from full model
tmpdir = tempfile.mkdtemp()
glm_mojo_model = model_with_good_predictors.download_mojo(tmpdir) # save to mojo

X, y = make_friedman1(n_samples = 100, n_features = 5, random_state=8452)
df = pd.DataFrame(X)
df['y'] = y
hdf_test = h2o.H2OFrame(df) # generate test dataset with only 5 predictors

generic_mojo_glm_from_file = H2OGenericEstimator.from_file(glm_mojo_model) # load mojo as generic model
predict_mojo = generic_mojo_glm_from_file.predict(hdf_test)
predict_model = model_with_good_predictors.predict(hdf_test)

for ind in range(hdf_test.nrows): # if you check the contents of the two prediction frames, they should be the same.
assert abs(predict_mojo[ind, 0]-predict_model[ind, 0]) < 1e-10

@wendycwong
Copy link
Contributor

@ANNIKADAHLMANN-8451

Can you work with the code I sent you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants