Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance Documentation on information about reproducibility #16554

Open
audreyleteve opened this issue Feb 14, 2025 · 1 comment
Open

Enhance Documentation on information about reproducibility #16554

audreyleteve opened this issue Feb 14, 2025 · 1 comment
Labels
Milestone

Comments

@audreyleteve
Copy link

Enhance Documentation on information/tips for reproducibility both in general and for each algorithm as available for GBM.

@audreyleteve
Copy link
Author

audreyleteve commented Feb 18, 2025

following general tips for reproducibility in h2o-3.
These are relevant for most of the algorithms which are deterministic and run on single node. For GBM on multiple nodes, you can check this page.

During Modeling:

  • Ensure that the seed parameter is used and set whenever a model is instantiated. H2O-3 sets the seed automatically and uses the same in the MOJO file, but it is still recommended to set the seed if you want control and traceability of the model training reproducibility.
  • In cases where AutoML is being utilized, the seed parameter must also be set accordingly.
  • While the seed parameter setting can typically drive reproducibility, there are exceptions, e.g. DeepLearning can be forced to be more reproducible by setting reproducible=True (but it’s significantly slower); in such cases, you can exclude deep learning models by setting the 'exclude_algos' parameter to include 'DeepLearning'.
  • H2O-3 typically would also try to use the time allotted to try as many models as it could to improve performance, so if setting the seed is not somehow resolving any reproducibility issues, switching to the 'max_models' parameter in place of 'max_runtime_secs' could mitigate the risks further, as such e.g. aml = H2OAutoML(seed=1234, exclude_algos=["DeepLearning"], max_models=20)
  • Most reproducibility issues in modeling (and eventually during production scoring) arise when datasets being used are not the same exact copy being used. Ensure that data being used are always consistent, similar, and complete. In the case of modeling, how splits are done for training, validation, and test are critical from the source file used.

During Mojo Deployment:

  • The first obvious one to check for is to ensure that the same exact MOJO file produced from modeling is the same exact one being deployed into production. As much as possible, do not try to reproduce a MOJO file by modeling from scratch in an entirely different environment, as there are multitude of countless variations in another environment that could easily introduce the creation of a MOJO file.
  • It is recommended to check the MOJO file intended for production to be the right one by ensuring the file has the same size, same date, and same hash value.
  • The production environment needs to be as similar to the environment from which the MOJO files are being tested on for reproducibility. The best practice is to test the MOJO file in a staging environment that is exactly same as the production environment before MOJO deployment into production. This includes the same amount of memory, the same number of CPUs, and the same software versions.
  • In a Kubernetes environment, ensure that resource quotas for requests and limits for a pod are consistent to avoid variations in memory and CPU allocation.
  • When deploying the MOJO model, ensure that the h2o-genmodel.jar file is included. This file is required for scoring and contains the necessary readers and interpreters for the MOJO model. The h2o-genmodel.jar file should be the same version as the one used during model training.
  • If deploying on AWS, transfer the MOJO file into the /tmp folder of the instance before launching. Ensure that the instance has the necessary permissions to access the MOJO file if it is stored in S3.

Treatment of Data Files:

  • It is critical to ensure that the input data format during production inference matches the format used during model training. This includes the same feature names and data preprocessing steps.
  • In many typical scenarios, when data is being curated, transformed, and new features are being engineered, this is the step that introduces the most risk to reproducibility. It is recommended that the data preparation process not be treated as a reproducible event, but more on checking that the data set being used as input to reproduce scoring results be the checkpoint.
  • Check that same exact data file is being used when comparing scoring results just before feeding into the MOJO pipeline for scoring. The true test will be to compare the hash value of a data file (from the original one being compared to) just before the MOJO scoring is done for example.

Variability from Clusters:

  • Note that there is an extra level of complexity involved whenever a cluster is being used. The use of clusters and nodes are challenging for reproducibility due to possible hardware variability, parallel and concurrent processes that could introduce variability, floating point arithmetic, data distribution if ordering of data is required, model randomness when seeds are not set properly, and dependencies on libraries or components that are not the same versions.
  • In the case of a single node cluster, in a model training scenario, only one file can be used. Multiple file imports and use may impact reproducibility. In addition, all parameters used in training must exactly be the same, and that the same exact seed is used in cases wherever and whenever data sampling is done.
  • In multi-node clusters, in addition to exact fixed seed, same parameters used, the cluster and nodes configuration must exactly be the same, i.e. same number of nodes, same number of CPU cores per node, and when training that the same leader node be used to start the process.

In a very strict sense, reproducibility cannot be guaranteed as there will be exceptions for floating point operations, but the above are best efforts.

@valenad1 valenad1 added this to the 3.46.0.7 milestone Feb 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants