Replicate result of cv splits with the solution of a fit #791

bsaldivaremc2 · 2025-02-13T15:26:00Z

Best regards.

I read and tested the previous issues:
Fit best model on new data in Optuna mode
Saving mljar automl model for future use

I want to replicate manually the results found with the normal fit (compete, optuna or other).
For instance, I already run fit with optuna mode and some custom cv_indices and the results were stored.
The cv_indices has 5 elements, so I want to do:

mean_metric = 0
for train_index, val_index in cv_indices:
 model = AutoML(...something with the path)
 model.fit(X[train_index,:],y[train_index])
 y_pred = model.predict_proba(X[val_index,:])
 mean_metric+=some_metric(y[val_index],y_pred)
print(mean_metric/5)

And this print should be similar to the results reported during the fit:.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

# Initialize AutoML with custom CV
automl = AutoML(
    mode="Optuna",  # Or "Explain" / "Perform"
    ml_task="binary_classification",
    results_path=f"{DATADIR}AutoML_Optuna_Results_2",
    validation_strategy={
        "validation_type": "custom",
        "custom_cv": cv_indices
    },
    eval_metric="auc",optuna_time_budget=60*5
)

# Fit the model
automl.fit(xdf, target,cv=cv_indices)

Nonetheless, the result I get is higher (like if the model already saw the data, therefore already part of "training set")
I appreciate your help.

The text was updated successfully, but these errors were encountered:

pplonski · 2025-02-14T14:12:02Z

Hi @bsaldivaremc2

I'm not sure what are you trying to do? Could you please provide more context? Are you trying to call AutoML 5 times on each fold and compare it with AutoML running on all folds?

bsaldivaremc2 · 2025-02-14T14:34:56Z

Yes.
Given some custom cv_indices (cross validation splits), I want to run AutoML to find the best model/models (final solution).
At the end I get a report of how the final solution performed across the custom cv_indices. Let's say roc_auc=0.8
Once this is done, I want to use the final solution configuration to manually do fit and predict across all the splits in the cv_indices, and obtain the average roc_auc equivalent to the reported when AutoML was searching for the solution.

An equivalent to :

from tpot import TPOTClassifier

# Initialize and train TPOT classifier
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, scoring='roc_auc', cv=cv_indices, random_state=42)
tpot.fit(X,y)

# Export the best pipeline
tpot.export('best_tpot_pipeline.py') # "solution"

Using the solution code from the best_tpot_pipeline.py:

metric = 0
for ti, vi in tqdm(cv_indices):
  exported_pipeline = make_pipeline(
      StackingEstimator(estimator=LinearSVC(C=1.0, dual=True, loss="hinge", penalty="l2", tol=0.001)),
      RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.1, min_samples_leaf=19, min_samples_split=20, n_estimators=100)
  )
  # Fix random state for all the steps in exported pipeline
  set_param_recursive(exported_pipeline.steps, 'random_state', 42)
  exported_pipeline.fit(X[ti], y[ti])
  preds = exported_pipeline.predict(X[vi])
  metric+=get_roc_auc(y[vi],preds)
final_metric = metric/len(cv_indices)

here, expect that final_metric is equivalent to the value reported by AutoML (I understand that my example is with tpot, sorry)
Thank you in advance.

pplonski · 2025-02-17T09:57:51Z

Hi @bsaldivaremc2,

It should be possible to extract each step from AutoML and run each fold training manually. We don't have such feature to automatically extract pipeline steps for selected model. So you need to debug selected model to get each step of the pipeline.

bsaldivaremc2 · 2025-02-17T10:33:27Z

Hi, thanks for the reply. Just to be sure I am communicating my doubt correctly. This is what I want to do:

import json
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

DATADIR="/content/"
# Initialize AutoML with custom CV
automl = AutoML(
    mode="Optuna",  
    ml_task="binary_classification",
    results_path=f"{DATADIR}AutoML_Optuna_Results_2",
    validation_strategy={
        "validation_type": "custom",
        "custom_cv": cv_indices
    },
    eval_metric="auc",optuna_time_budget=60*5
)

# Fit the model
automl.fit(xdf, target,cv=cv_indices)#e.g auc=0.8

#Load best solution
my_params = json.load(open('/content/AutoML_Optuna_Results_2/optuna/optuna.json'))

# Fit on new data
y_trues,y_preds = [],[]
mean_auc = 0

D="/content/AutoML_Optuna_Results_2/"
for ti,vi in tqdm(cv_indices):
  model = AutoML(mode='Optuna', results_path=D, optuna_init_params=my_params)
  model.fit(xdf[ti],target[ti])
  y_pred = model.predict_proba(xdf[vi])
  y_true = target[vi]
  y_trues.append(y_true)
  y_preds.append(y_pred[:,1])
  mean_auc += roc_auc(y_true,y_pred[:,1])
mean_auc /= len(cv_indices) # expecting be same as result in fit, 0.8

When I run this I always get a mean_auc higher than the one found during fit. Like if the model already knows all folds. I feel like in the loop I am doing at the end I am giving the model to fit on new data progressively and therefore it learns all folds.

Could you please point out if I am doing this wrongly?
Thank you again

pplonski · 2025-02-17T10:47:24Z

@bsaldivaremc2 when you run AutoML for second time you have smaller dataset ... do you want to have quick chat on google meets? send me an email at [email protected]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replicate result of cv splits with the solution of a fit #791

Replicate result of cv splits with the solution of a fit #791

bsaldivaremc2 commented Feb 13, 2025

pplonski commented Feb 14, 2025

bsaldivaremc2 commented Feb 14, 2025

pplonski commented Feb 17, 2025

bsaldivaremc2 commented Feb 17, 2025

pplonski commented Feb 17, 2025

Replicate result of cv splits with the solution of a fit #791

Replicate result of cv splits with the solution of a fit #791

Comments

bsaldivaremc2 commented Feb 13, 2025

pplonski commented Feb 14, 2025

bsaldivaremc2 commented Feb 14, 2025

pplonski commented Feb 17, 2025

bsaldivaremc2 commented Feb 17, 2025

pplonski commented Feb 17, 2025