-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
automatic detection of multioutput datasets #1001
Open
jhmenke
wants to merge
7
commits into
EpistasisLab:development
Choose a base branch
from
jhmenke:development
base: development
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 6 commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
3dd6475
automatic detection of multioutput datasets
jhmenke e017fbe
Merge branch 'master' of https://github.com/EpistasisLab/tpot into de…
jhmenke 81c0079
bugfix in operator_utils
jhmenke ae72c92
enable automatic multioutput for default regressors/classifiers
jhmenke 2f5140a
Merge branch 'development' into development
jhmenke 124ec4b
Update operator_utils.py
weixuanfu a86fdf9
Update operator_utils.py
weixuanfu File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -491,8 +491,7 @@ def _setup_toolbox(self): | |
self._toolbox.register('expr_mut', self._gen_grow_safe, min_=self._min, max_=self._max) | ||
self._toolbox.register('mutate', self._random_mutation_operator) | ||
|
||
|
||
def _fit_init(self): | ||
def _fit_init(self, multi_output_target: bool = False): | ||
# initialization for fit function | ||
if not self.warm_start or not hasattr(self, '_pareto_front'): | ||
self._pop = [] | ||
|
@@ -501,6 +500,35 @@ def _fit_init(self): | |
self._last_optimized_pareto_front_n_gens = 0 | ||
self._setup_config(self.config_dict) | ||
|
||
if multi_output_target: | ||
single_output_classifiers = [ | ||
'sklearn.naive_bayes.MultinomialNB', | ||
'sklearn.svm.LinearSVC', | ||
'xgboost.XGBClassifier' | ||
] | ||
single_output_regressors = [ | ||
'sklearn.ensemble.AdaBoostRegressor', | ||
'sklearn.linear_model.LassoLarsCV', | ||
'sklearn.linear_model.ElasticNetCV', | ||
'sklearn.svm.LinearSVR', | ||
'xgboost.XGBRegressor', | ||
'sklearn.linear_model.SGDRegressor' | ||
] | ||
for model in list(self._config_dict.keys()): | ||
if model in single_output_classifiers: | ||
if 'sklearn.multioutput.MultiOutputClassifier' not in self._config_dict.keys(): | ||
self._config_dict['sklearn.multioutput.MultiOutputClassifier'] = {"estimator": {}} | ||
self._config_dict['sklearn.multioutput.MultiOutputClassifier']['estimator'][model] = self._config_dict[model] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is only one sklearn.multioutput.MultiOutputClassifier in self._config_dict and |
||
self._config_dict.pop(model, None) | ||
elif model in single_output_regressors: | ||
if 'sklearn.multioutput.MultiOutputRegressor' not in self._config_dict.keys(): | ||
self._config_dict['sklearn.multioutput.MultiOutputRegressor'] = {"estimator": {}} | ||
if model == 'sklearn.linear_model.ElasticNetCV': | ||
self._config_dict['sklearn.linear_model.MultiTaskElasticNetCV'] = self._config_dict[model] | ||
else: | ||
self._config_dict['sklearn.multioutput.MultiOutputRegressor']['estimator'][model] = self._config_dict[model] | ||
self._config_dict.pop(model, None) | ||
|
||
self._setup_template(self.template) | ||
|
||
self.operators = [] | ||
|
@@ -622,7 +650,7 @@ def fit(self, features, target, sample_weight=None, groups=None): | |
Returns a copy of the fitted TPOT object | ||
|
||
""" | ||
self._fit_init() | ||
self._fit_init(multi_output_target=len(target.shape) > 1 and target.shape[1] > 1) | ||
features, target = self._check_dataset(features, target, sample_weight) | ||
|
||
|
||
|
@@ -792,10 +820,11 @@ def _update_top_pipeline(self): | |
if not self._optimized_pipeline: | ||
raise RuntimeError('There was an error in the TPOT optimization ' | ||
'process. This could be because the data was ' | ||
'not formatted properly, or because data for ' | ||
'not formatted properly, because data for ' | ||
'a regression problem was provided to the ' | ||
'TPOTClassifier object. Please make sure you ' | ||
'passed the data to TPOT correctly.') | ||
'TPOTClassifier object, or an error in a ' | ||
'custom scoring function. Please make sure ' | ||
'you passed the data to TPOT correctly.') | ||
else: | ||
pareto_front_wvalues = [pipeline_scores.wvalues[1] for pipeline_scores in self._pareto_front.keys] | ||
if not self._last_optimized_pareto_front: | ||
|
@@ -1157,7 +1186,7 @@ def _check_dataset(self, features, target, sample_weight=None): | |
|
||
try: | ||
if target is not None: | ||
X, y = check_X_y(features, target, accept_sparse=True, dtype=None) | ||
X, y = check_X_y(features, target, accept_sparse=True, dtype=None, multi_output=len(target.shape) > 1 and target.shape[1] > 1) | ||
if self._imputed: | ||
return X, y | ||
else: | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modifying _config_dict may not work in the situation that use use a customized configurations instead of default one. So, I think a practical way is to modify the
_compile_to_sklearn
function (here). Ifmulti_output_target
isTrue
, thensklearn_pipeline=MultiOutputClassifier(estimator=sklearn_pipeline)
orsklearn_pipeline=MultiOutputRegessor(estimator=sklearn_pipeline)
. I think it maybe a more general solution for multioutput dataset.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems better, to be honest i didn't find a good place where to put my code and only settled on the _fit_init function therefore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay i looked into it, but the code would be a mess. Several functions would have to take multi_output_target as a new argument (most of them in export_utils.py), since they don't have access to the data or the TPOT Object.
Imo _fit_init seems to be the least intrusive point to include the checks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into. You are right. I think TPOT exported codes should also include MultiOutputRegessor/MultiOutputClassifier, which should change a lot of codes in TPOT. I will look into it when I get some time next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jhmenke I am sorry for overlooking this. I did not get a chance to look into this issue those days due to my busy schedule. I agree that TPOT need some major changes for including
MultiOutputRegessor/MultiOutputClassifier
. I may get some time in March to add those changes. You are welcome to push any changes meanwhile.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use my PR as a temporary fix until you have time to thoroughly refactor the code? I can prepare an update with the current development branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay. I think we can use it for a temporary solution with a minor release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay i merged the current master/development into this. Should be good to go as an interim solution then.