Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate result of cv splits with the solution of a fit #791

Open
bsaldivaremc2 opened this issue Feb 13, 2025 · 5 comments
Open

Replicate result of cv splits with the solution of a fit #791

bsaldivaremc2 opened this issue Feb 13, 2025 · 5 comments

Comments

@bsaldivaremc2
Copy link

Best regards.

I read and tested the previous issues:
Fit best model on new data in Optuna mode
Saving mljar automl model for future use

I want to replicate manually the results found with the normal fit (compete, optuna or other).
For instance, I already run fit with optuna mode and some custom cv_indices and the results were stored.
The cv_indices has 5 elements, so I want to do:

mean_metric = 0
for train_index, val_index in cv_indices:
 model = AutoML(...something with the path)
 model.fit(X[train_index,:],y[train_index])
 y_pred = model.predict_proba(X[val_index,:])
 mean_metric+=some_metric(y[val_index],y_pred)
print(mean_metric/5)

And this print should be similar to the results reported during the fit:.

import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

# Initialize AutoML with custom CV
automl = AutoML(
    mode="Optuna",  # Or "Explain" / "Perform"
    ml_task="binary_classification",
    results_path=f"{DATADIR}AutoML_Optuna_Results_2",
    validation_strategy={
        "validation_type": "custom",
        "custom_cv": cv_indices
    },
    eval_metric="auc",optuna_time_budget=60*5
)

# Fit the model
automl.fit(xdf, target,cv=cv_indices)

Nonetheless, the result I get is higher (like if the model already saw the data, therefore already part of "training set")
I appreciate your help.

@pplonski
Copy link
Contributor

Hi @bsaldivaremc2

I'm not sure what are you trying to do? Could you please provide more context? Are you trying to call AutoML 5 times on each fold and compare it with AutoML running on all folds?

@bsaldivaremc2
Copy link
Author

Yes.
Given some custom cv_indices (cross validation splits), I want to run AutoML to find the best model/models (final solution).
At the end I get a report of how the final solution performed across the custom cv_indices. Let's say roc_auc=0.8
Once this is done, I want to use the final solution configuration to manually do fit and predict across all the splits in the cv_indices, and obtain the average roc_auc equivalent to the reported when AutoML was searching for the solution.

An equivalent to :

from tpot import TPOTClassifier

# Initialize and train TPOT classifier
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, scoring='roc_auc', cv=cv_indices, random_state=42)
tpot.fit(X,y)

# Export the best pipeline
tpot.export('best_tpot_pipeline.py') # "solution"

Using the solution code from the best_tpot_pipeline.py:

metric = 0
for ti, vi in tqdm(cv_indices):
  exported_pipeline = make_pipeline(
      StackingEstimator(estimator=LinearSVC(C=1.0, dual=True, loss="hinge", penalty="l2", tol=0.001)),
      RandomForestClassifier(bootstrap=False, criterion="gini", max_features=0.1, min_samples_leaf=19, min_samples_split=20, n_estimators=100)
  )
  # Fix random state for all the steps in exported pipeline
  set_param_recursive(exported_pipeline.steps, 'random_state', 42)
  exported_pipeline.fit(X[ti], y[ti])
  preds = exported_pipeline.predict(X[vi])
  metric+=get_roc_auc(y[vi],preds)
final_metric = metric/len(cv_indices)

here, expect that final_metric is equivalent to the value reported by AutoML (I understand that my example is with tpot, sorry)
Thank you in advance.

@pplonski
Copy link
Contributor

Hi @bsaldivaremc2,

It should be possible to extract each step from AutoML and run each fold training manually. We don't have such feature to automatically extract pipeline steps for selected model. So you need to debug selected model to get each step of the pipeline.

@bsaldivaremc2
Copy link
Author

Hi, thanks for the reply. Just to be sure I am communicating my doubt correctly. This is what I want to do:

import json
import pandas as pd
from sklearn.model_selection import train_test_split
from supervised.automl import AutoML

DATADIR="/content/"
# Initialize AutoML with custom CV
automl = AutoML(
    mode="Optuna",  
    ml_task="binary_classification",
    results_path=f"{DATADIR}AutoML_Optuna_Results_2",
    validation_strategy={
        "validation_type": "custom",
        "custom_cv": cv_indices
    },
    eval_metric="auc",optuna_time_budget=60*5
)

# Fit the model
automl.fit(xdf, target,cv=cv_indices)#e.g auc=0.8

#Load best solution
my_params = json.load(open('/content/AutoML_Optuna_Results_2/optuna/optuna.json'))

# Fit on new data
y_trues,y_preds = [],[]
mean_auc = 0

D="/content/AutoML_Optuna_Results_2/"
for ti,vi in tqdm(cv_indices):
  model = AutoML(mode='Optuna', results_path=D, optuna_init_params=my_params)
  model.fit(xdf[ti],target[ti])
  y_pred = model.predict_proba(xdf[vi])
  y_true = target[vi]
  y_trues.append(y_true)
  y_preds.append(y_pred[:,1])
  mean_auc += roc_auc(y_true,y_pred[:,1])
mean_auc /= len(cv_indices) # expecting be same as result in fit, 0.8

When I run this I always get a mean_auc higher than the one found during fit. Like if the model already knows all folds. I feel like in the loop I am doing at the end I am giving the model to fit on new data progressively and therefore it learns all folds.

Could you please point out if I am doing this wrongly?
Thank you again

@pplonski
Copy link
Contributor

@bsaldivaremc2 when you run AutoML for second time you have smaller dataset ... do you want to have quick chat on google meets? send me an email at [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants