[c++] enhance error handling for forced splits file loading #6832

KYash03 · 2025-02-16T21:38:02Z

jameslamb · 2025-02-16T22:32:49Z

Thanks for working on this, can you please add some tests that cover these exceptions?

KYash03 · 2025-02-17T01:11:38Z

@microsoft-github-policy-service agree

jameslamb

Thanks for working on this! The general approach looks good and the error messages are informative. Nice idea thinking about "file exists but cannot be parsed" as a separate case too!

But I think this deserves some more careful consideration to be sure that we don't end up introducing a requirement on the file indicated by forcedsplits_filename also existing at scoring (prediction) time.

jameslamb · 2025-02-17T03:28:49Z

src/boosting/gbdt.cpp

+    if (!forced_splits_file.good()) {
+      Log::Warning("Forced splits file '%s' does not exist. Forced splits will be ignored.",
+                  config->forcedsplits_filename.c_str());


I think this should be a fatal error at training time... if I'm training a model and expecting specific splits to be used, I'd prefer a big loud error to a training run wasting time and compute resources only to produce a model that accidentally does not look like what I'd wanted.

HOWEVER... I think GBDT::Init() and/or GBDT::ResetConfig() will also be called when you load a model at scoring time, and at scoring time we wouldn't want to get a fatal error because of a missing or malformed file which is only supposed to affect training.

I'm not certain how to resolve that. Can you please investigate that and propose something?

It would probably be helpful to add tests for these different conditions. You can do this in Python for this purpose. Or if you don't have time / interest, I can push some tests here and then you could work on making them pass?

So to be clear, the behavior I want to see is:

training time:

forcedsplits_filename file does not exist or is not readable --> ERROR

forcedsplits_filename is not valid JSON --> ERROR

prediction / scoring time:

forcedsplits_filename file does not exist or is not readable --> no log output, no errors

forcedsplits_filename is not valid JSON --> no log output, no errors

We could add a flag to the GBDT class to indicate the current mode.

This is what I was thinking:

bool is_training_ = false; // Turn the flag on at the start of training, and off at the end. void GBDT::Train() { is_training_ = true; // ... regular training code ... is_training_ = false; } // In Init() and ResetConfig(), handle the file as follows: if (is_training_) { // Stop with an error if anything is wrong. } else { // Simply continue if there are issues. }

Regarding the tests, I'd be happy to write them!

Thanks very much. It is not that simple.

For example, there are many workflows where training and prediction are done in the same process, using the same Booster. So a single property is_training_ is not going to work.

There are also multiple APIs for training.

LightGBM/src/boosting/gbdt.cpp

Line 237 in 3fad53b

void GBDT::Train(int snapshot_freq, const std::string& model_output_path) {

LightGBM/src/boosting/gbdt.cpp

Line 344 in 3fad53b

bool GBDT::TrainOneIter(const score_t* gradients, const score_t* hessians) {

And we'd also want to be careful to not introduce this type of checking on every boosting round, as that would hurt performance.

Maybe @shiyu1994 could help us figure out where to put a check like this.

Also referencing this related PR to help: #5653

KYash03 requested review from guolinke, jameslamb, shiyu1994, jmoralez, borchero and StrikerRUS as code owners February 16, 2025 21:38

KYash03 force-pushed the fix/forcedsplits-file-error branch from 0cd73e5 to 7e35462 Compare February 16, 2025 21:38

KYash03 mentioned this pull request Feb 16, 2025

[c++] forcedsplits_filename pointing at a non-existent file is silently ignored #6830

Open

jameslamb added the fix label Feb 16, 2025

jameslamb added the in progress label Feb 16, 2025

KYash03 force-pushed the fix/forcedsplits-file-error branch from 7e35462 to 05430e5 Compare February 17, 2025 01:08

[gbdt] enhance error handling for forced splits file loading

133cc75

KYash03 force-pushed the fix/forcedsplits-file-error branch from 05430e5 to 133cc75 Compare February 17, 2025 01:13

jameslamb changed the title ~~[gbdt] enhance error handling for forced splits file loading~~ [c++] enhance error handling for forced splits file loading Feb 17, 2025

jameslamb requested changes Feb 18, 2025

View reviewed changes

Merge branch 'microsoft:master' into fix/forcedsplits-file-error

c1ace38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c++] enhance error handling for forced splits file loading #6832

[c++] enhance error handling for forced splits file loading #6832

KYash03 commented Feb 16, 2025 •

edited

Loading

jameslamb commented Feb 16, 2025 •

edited

Loading

KYash03 commented Feb 17, 2025

jameslamb left a comment

jameslamb Feb 17, 2025

KYash03 Feb 18, 2025 •

edited

Loading

jameslamb Feb 18, 2025

[c++] enhance error handling for forced splits file loading #6832

Are you sure you want to change the base?

[c++] enhance error handling for forced splits file loading #6832

Conversation

KYash03 commented Feb 16, 2025 • edited Loading

jameslamb commented Feb 16, 2025 • edited Loading

KYash03 commented Feb 17, 2025

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb Feb 17, 2025

Choose a reason for hiding this comment

KYash03 Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

jameslamb Feb 18, 2025

Choose a reason for hiding this comment

KYash03 commented Feb 16, 2025 •

edited

Loading

jameslamb commented Feb 16, 2025 •

edited

Loading

KYash03 Feb 18, 2025 •

edited

Loading