Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc (api): Revisit the API of put and get #1269

Open
glemaitre opened this issue Jan 31, 2025 · 13 comments
Open

rfc (api): Revisit the API of put and get #1269

glemaitre opened this issue Jan 31, 2025 · 13 comments

Comments

@glemaitre
Copy link
Member

glemaitre commented Jan 31, 2025

In the past, we had many discussions where @GaelVaroquaux or myself complain with the needs to use .put and specified the name of an item to store it (and get it).

I came #949 to mention that I wanted my CrossValidationReport accessible from project such that this operation happen. However, one limitation as mentioned by @auguste-probabl is that native Python type cannot benefit from this API. There is also discussion that you can use the same class into different ways that could be confusing.

In an IRL with @adrinjalali, an API that seems much better would be to have a dict/Bunch UX to deal with this use case:

project.put("my_int, 1)

would be replaced by:

project.artifacts.my_int = 1

or for a report

report = CrossValidationReport(...)
project.put("estimator_report, report)

would be replaced by:

project.artifacts.estimator_report = CrossValidateRerport(...)

To access, the latest version of an items:

# pushing to the extreme without storing the artifact in another variable
project.artifacts.estimator_report.metrics.roc.plot()

would work.

All of those would work if we consider that skore.open set implicitly the "run" for which we store and access the items. We would still need .get if one wants to access an item from a different run where we come back to the discussion on having skore.get_metadata or similar to be able to query the database.

@auguste-probabl
Copy link
Contributor

So

project.artifacts.estimator_report.metrics.roc.plot()

would not store anything in skore? It feels confusing that some things are saved and some aren't

@auguste-probabl
Copy link
Contributor

With the

project.artifacts.estimator_report = CrossValidateReport(...)

version I don't know how a user would add metadata about their object (see #889)

Also, what key/ID would we give the report inside skore so that the user can find it again?

@glemaitre
Copy link
Member Author

would not store anything in skore? It feels confusing that some things are saved and some aren't

An assignment with = is equivalent to put and accessing the attribute is equivalent to get. So to put, it will look like:

project.artifacts.roc_plot = project.artifacts.estimator_report.metrics.roc.plot()

@glemaitre
Copy link
Member Author

glemaitre commented Feb 3, 2025

Also, what key/ID would we give the report inside skore so that the user can find it again?

The name of the attribute that you used during the assignment.

@glemaitre
Copy link
Member Author

I don't know how a user would add metadata about their object (see #889)

I'm not sure either, each stored item could be a dict with a metadata key that you can edit.

project.artifacts.estimator_report = CrossValidateReport(...)
project.artifacts.estimator_report.metadata = ...

It would mean that under the hood, it would be something explicitly equivalent to:

project.artifacts.estimator_report = {
    "item": CrossValidateReport(...)
    "metadata": ...
}

but overloading the __get_items__

@thomass-dev
Copy link
Collaborator

thomass-dev commented Feb 3, 2025

It would mean that under the hood, it would be something explicitly equivalent to:

This is a bit strange, because calling project.artifacts.estimator_report, the user wants i suppose to access the CrossValidateReport, not a wrapper {"item": CrossValidateReporter(...)}.

We can make metadata live alongside it, such as project.metadata.estimator_report = [...].
Otherwise, i can't see how the user can access the item without the need to type project.artifacts.estimator_report.item or project.artifacts.estimator_report["item"]

@glemaitre
Copy link
Member Author

glemaitre commented Feb 3, 2025

This is a bit strange, because calling project.artifacts.estimator_report, the user wants i suppose to access the CrossValidationReport, not a wrapper.

I agree with you. My thought would be more to maybe overload the __getitem__ of this special underlying dict to return the item directly when calling something like project.artifacts.estimator_report.

We can make metadata live alongside it, such as project.metadata.estimator_report = [...]

This is an alternative that is cleaner and would require less workaround, I like it.

@thomass-dev
Copy link
Collaborator

thomass-dev commented Feb 3, 2025

I agree with you. My thought would be more to maybe overload the getitem of this special underlying dict to return the item directly when calling something like project.artifacts.estimator_report.

I don't see how it's technically possible to return project.artifacts.estimator_report["item"] when calling project.artifacts.estimator_report but project.artifacts.estimator_report["metadata"] on project.artifacts.estimator_report.metadata.

The last one is called on the result of the first one, so we have to store metadata directly in the CrossValidateReport for this case.

@thomass-dev
Copy link
Collaborator

thomass-dev commented Feb 3, 2025

Just a comment: with an API like this, we should be explicit that the user can't modify his objects without reassignment.
An example like below can't work:

project.artifacts.my_dict_object = {"a": {"b": 0}}
project.artifact.my_dict_object["a"]["b"] = 1

print(project.artifact.my_dict_object) -> {"a": {"b": 0}}

I don't think we want to deal with that.

@glemaitre
Copy link
Member Author

glemaitre commented Feb 3, 2025

I don't see how it's technically possible to return project.artifacts.estimator_report["item"] when calling project.artifacts.estimator_report but project.artifacts.estimator_report["metadata"] on project.artifacts.estimator_report.metadata.

There might always be a way in Python

from collections import UserDict


class DataItem(UserDict):
    def __init__(self):
        super().__init__()
        self.data = {'data': None, 'metadata': {}}

    def __setitem__(self, key, value):
        if key in ('data', 'metadata'):
            self.data[key] = value
        else:
            self.data['data'][key] = value

    def __getitem__(self, key):
        if key in ('data', 'metadata'):
            return self.data[key]
        return self.data['data'][key]

    def __setattr__(self, name, value):
        if name == 'data' and isinstance(value, dict):
            # This handles UserDict's internal 'data' attribute
            super().__setattr__(name, value)
        elif name in ('data', 'metadata'):
            self.data[name] = value
        else:
            self.data['data'] = value

    def __getattr__(self, name):
        if name in ('data', 'metadata'):
            return self.data[name]
        # Delegate attribute access to the stored object
        return getattr(self.data['data'], name)

    def __str__(self):
        return str(self.data['data'])

    def __repr__(self):
        return repr(self.data['data'])


class Item(UserDict):
    def __init__(self):
        super().__init__()
        self.data = {}

    def __getattr__(self, name):
        if name not in self.data:
            self.data[name] = DataItem()
        return self.data[name]

    def __setattr__(self, name, value):
        if name == 'data':
            super().__setattr__(name, value)
        else:
            if name not in self.data:
                self.data[name] = DataItem()
            self.data[name].data = value

class A:
    def __init__(self):
        self.artifacts = Item()


class B:
    def func(self):
        print("hello world")
a = A()

a.artifacts.example_1 = B()
print(a.artifacts.example_1.data)  # {'data': <__main__.B object at 0x106313410>, 'metadata': {}}
print(a.artifacts.example_1)  # <__main__.B object at 0x106313410>

a.artifacts.example_1.metadata = {"created": "2025-02-03"}
print(a.artifacts.example_1.metadata)  # {"created": "2025-02-03"}

print(a.artifacts.example_1)  # <__main__.B object at 0x106313410>
print(a.artifacts.example_1.data)  # {'data': <__main__.B object at 0x1062fc440>, 'metadata': {'created': '2025-02-03'}}
a.artifacts.example_1.func()  # "hello world"

will provide

{'data': <__main__.B object at 0x106313410>, 'metadata': {}}
<__main__.B object at 0x106313410>
{'created': '2025-02-03'}
<__main__.B object at 0x106313410>
{'data': <__main__.B object at 0x1062fc440>, 'metadata': {'created': '2025-02-03'}}
hello world

When opening the "details", I think I much prefer your explicit solution with separate metadata by far :)

@thomass-dev
Copy link
Collaborator

thomass-dev commented Feb 3, 2025

@glemaitre You are changing the type of the return of a.artifacts.example_1, which is not an instance of B but an instance of DataItem. So you are not returning the exact object that have been put.

For instance, i can't do

a = A()
a.artifacts.my_int = 1

# [...]
my_int = a.artifacts.my_int
my_int + 1

So i persist, i don't know how to do in a clean way without overloading the original object before it is returned.
I don't think we want to use __getattr__ to overload every functions of every classes.
What we need here, is a sort of __call__ magic function, without the need of () 😆 .

@glemaitre
Copy link
Member Author

glemaitre commented Feb 3, 2025

which is not an instance of B but an instance of DataItem

Indeed, I just try to hide as much as possible to the user. So the next step is to overload all operators to delegate (not sure what will be the next case :)).

But all this complexity tell me that project.metadata is really nice :)

@thomass-dev
Copy link
Collaborator

thomass-dev commented Feb 3, 2025

Indeed, I just try to hide as much as possible to the user. So the next step is to overload all operators to delegate (not sure what will be the next case :)).

This is fundamentally a no-go haha. Yes, i think we should focus on project.metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants