Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat(EstimatorReport): Display the feature permutation importance #1319

Open
Tracked by #1314
MarieSacksick opened this issue Feb 14, 2025 · 9 comments · May be fixed by #1365
Open
Tracked by #1314

Feat(EstimatorReport): Display the feature permutation importance #1319

MarieSacksick opened this issue Feb 14, 2025 · 9 comments · May be fixed by #1365
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@MarieSacksick
Copy link
Contributor

MarieSacksick commented Feb 14, 2025

Is your feature request related to a problem? Please describe.

As a Data Scientist, to explain my model and understand the problem I'm trying to solve, I need to check the feature importance by a permutation method. This should be available to any kind of model.

Describe the solution you'd like

df = report.feature_importance.feature_permutation(scoring = "a scoring method") # renders a dataframe
display = report.feature_importance.plot.feature_permutation(scoring = "a scoring method") # renders a display
display.plot()

Describe alternatives you've considered, if relevant

Later, if the object report contains too many accessors, we will group the feature importance and add a parameter to decide which of the feature importance type we want to display.

Additional context

part of the epic #1314

@MarieSacksick MarieSacksick added enhancement New feature or request needs-triage This has been recently submitted and needs attention labels Feb 14, 2025
@MarieSacksick MarieSacksick added this to the skore 0.8 milestone Feb 14, 2025
@auguste-probabl
Copy link
Contributor

What is the difference with #1323 ?

@MarieSacksick
Copy link
Contributor Author

I forgot to change the title of 1323, thanks!

@MarieSacksick MarieSacksick removed the needs-triage This has been recently submitted and needs attention label Feb 17, 2025
@auguste-probabl
Copy link
Contributor

Which data should be used to compute the permutation importance? Should we accept arguments like data_source="test"/"train"/"X_y"?

@MarieSacksick
Copy link
Contributor Author

by default, test, but yes, adding this data_source parameter will be perfect!

@auguste-probabl
Copy link
Contributor

permutation_importance accepts a "scoring" parameter, that can be a list of metrics. In that case the result looks like this:

{
    'r2': {
        'importances_mean': array([1.72819206, 0.07236024, 0.0503269 ]),
		'importances_std': array([0.37087406, 0.00928599, 0.00382661]),
		'importances': array([[1.99305438, 1.27627584, 1.41891128, 1.66448695, 2.28823186],
						      [0.06935003, 0.09033894, 0.07040989, 0.06812194, 0.06358039],
						      [0.04762681, 0.055217  , 0.04907298, 0.04538669, 0.05433103]])
	},
    'neg_root_mean_squared_error': {
        'importances_mean': array([139.1296646 ,  28.57762622,  23.86150889]),
		'importances_std': array([14.93272773,  1.77354259,  0.90637781]),
		'importances': array([[150.26937093, 120.24946953, 126.79102571, 137.32546909, 161.01298776],
                              [ 28.03071877,  31.99251603,  28.24409869,  27.78141644, 26.83938118],
                              [ 23.22932701,  25.01193447,  23.57936469,  22.6764554 , 24.81046286]])
	}
}

Can you give an example of what you'd expect the dataframe to look like?

@MarieSacksick
Copy link
Contributor Author

We can output something similar to what we have in the ComparisonReport or in the CrossValidationReport: several lines for several scores, and the features are the columns. It's not very pretty, I would expect the scorings to be max 5 and the features at least 10, making it logical to have the long list as the lines index, but it makes it consistant this way.

Image

@auguste-probabl
Copy link
Contributor

Here is what I currently have:

Repeat                                   Repeat #0   Repeat #1   Repeat #2   Repeat #3   Repeat #4
Metric                      Feature
r2                          Feature #0    1.993054    1.276276    1.418911    1.664487    2.288232
                            Feature #1    0.069350    0.090339    0.070410    0.068122    0.063580
                            Feature #2    0.047627    0.055217    0.049073    0.045387    0.054331
neg_root_mean_squared_error Feature #0  150.269371  120.249470  126.791026  137.325469  161.012988
                            Feature #1   28.030719   31.992516   28.244099   27.781416   26.839381
                            Feature #2   23.229327   25.011934   23.579365   22.676455   24.810463

@auguste-probabl
Copy link
Contributor

permutation_importance accepts a random_state parameter, which is None by default (so calling the function returns a different result every time).
Right now the plan is to cache calls, so this behaviour is inconvenient. Should we:

  • Impose a random_state?
  • Only cache if random_state is given?
  • Stop caching?

@MarieSacksick
Copy link
Contributor Author

Good point!
I'd like to keep caching because it's a nice feature, in particular for artefacts requiring a lot of computing time, and feature importance can be one. I don't like imposing a random state (I find it unfriendly), nor decide one ourselves if the user doesn't provide one (I find it surprising).
So I'd go for your second option!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants