Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change how kedro import datasets and raise explicit error when dependencies is missing #3272

Merged
merged 20 commits into from
Nov 9, 2023

Conversation

noklam
Copy link
Contributor

@noklam noklam commented Nov 6, 2023

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

Fix #2943

How are datasets can be imported?

  1. Explicit import in Python -
from kedro_datasets.biosequence import BioSequenceDataSet
  1. Via kedro's _load_ob -

    kedro/kedro/io/core.py

    Lines 381 to 389 in 38dfe6f

    if isinstance(class_obj, str):
    if len(class_obj.strip(".")) != len(class_obj):
    raise DataSetError(
    "'type' class path does not support relative "
    "paths or paths ending with a dot."
    )
    class_paths = (prefix + class_obj for prefix in _DEFAULT_PACKAGES)
    trials = (_load_obj(class_path) for class_path in class_paths)

In particular, kedro try to import datasets from a _DEFAULT_PACKAGES iteratively

_DEFAULT_PACKAGES = ["kedro.io.", "kedro_datasets.", "kedro.extras.datasets.", ""]

For example, when user do catalog.load("companies"), the following happens

  1. import from kedro.io.pandas - ModuleNotFoundError
  2. kedro_datasets.pandas is found and exit the loop

The actual import mechanism is hidden in kedro.io.core's parse_dataset_definition which has extra try/except with _load_obj.

To test this, you can do this.

kedro new -s pandas-iris
pip uninstall pandas
kedro run

or you can directly trigger this import mechanism via the private method.

from kedro.io.core import parse_dataset_definition
config = {'type': 'pandas.CSVDataset',
           'filepath': '/01_raw/iris.csv'}


parse_dataset_definition(config)

Development notes

The main change is to differentiate two different scenarios.
Noted that since kedro_datasets moved to lazy loading, hasattr can trigger a ModuleNotFoundError.

Questions for reviewer

  1. I remove the AttributeError in load_obj, honestly I don't think that add any more information and make the code obscure.

Previously the error was hidden because ModulenotFoundError can be two cases:

  1. The datasets doesn't exist. i.e. `kedro_datasets.fake_dataset.x
  2. The datasets exist, but the underlying dependencies is missing (i.e. pandas)

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@noklam noklam force-pushed the noklam/make-import-failures-in-2943 branch 2 times, most recently from face399 to 93dc1a9 Compare November 6, 2023 07:05
@noklam noklam changed the base branch from main to develop November 6, 2023 07:15
@noklam
Copy link
Contributor Author

noklam commented Nov 6, 2023

Github Action is not triggered, not sure why.

Signed-off-by: Nok Chan <[email protected]>
@noklam noklam linked an issue Nov 6, 2023 that may be closed by this pull request
@noklam noklam self-assigned this Nov 6, 2023
* Replace a `StopIteration` exception with for-else 

Signed-off-by: Nok Lam Chan <[email protected]>

* fix the logic

Signed-off-by: Nok <[email protected]>

* bug fix

Signed-off-by: Nok <[email protected]>

---------

Signed-off-by: Nok Lam Chan <[email protected]>
Signed-off-by: Nok <[email protected]>
@astrojuanlu
Copy link
Member

Could exception groups help here? https://github.com/agronholm/exceptiongroup

Signed-off-by: Nok Chan <[email protected]>
Comment on lines -26 to -27
if not hasattr(module_obj, obj_name):
raise AttributeError(f"Object '{obj_name}' cannot be loaded from '{obj_path}'.")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can now create a ModuleNotFoundError due to lazy-loading. I decide to remove it because it adds more complexity but the error message itself isn't helpful at all.

@noklam
Copy link
Contributor Author

noklam commented Nov 7, 2023

When dependencies is missing.

DatasetError: No module named 'pandas' Please see the documentation on how to install relevant dependencies for kedro_datasets.pandas.CSVDataset:
https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html

When we have a typo.

DatasetError: Class 'pandaxx.CSVDataset' not found or one of its dependencies has not been installed.

@astrojuanlu I'd like to get some review for the current approach. We can change the error message (or in a separate PR if we are changing all of them).

ps. All tests passed except 1 because I updated what load_obj do, currently it is only used in two place, load_obj itself is a weird function, it is the only function in kedro.utils which was created 4 years ago.

@noklam noklam marked this pull request as ready for review November 7, 2023 10:32
@astrojuanlu
Copy link
Member

I confirm this now disambiguates both cases 👍🏽

The error messages could be:

@merelcht
Copy link
Member

merelcht commented Nov 7, 2023

I confirm this now disambiguates both cases 👍🏽

I can also confirm this 🙂

The error messages could be:

  • Typo: "Class 'biosequence.BiosequenceDataSet' not found" (hence remove the "or one of its dependencies has not been installed"?)

For this one, should we add anything about checking if kedro-datasets is installed?

👍

Signed-off-by: Nok Chan <[email protected]>
@noklam noklam changed the title [DRAFT] - Change how kedro import datasets and raise explicit error when dependencies is missing Change how kedro import datasets and raise explicit error when dependencies is missing Nov 8, 2023
Signed-off-by: Nok Chan <[email protected]>
@noklam
Copy link
Contributor Author

noklam commented Nov 8, 2023

Typo: "Class 'biosequence.BiosequenceDataSet' not found" (hence remove the "or one of its dependencies has not been installed"?)
@astrojuanlu We still need to consider custom dataset or plugin, missing package can be a valid cause. Consider this case.

In [4]: from kedro.io.core import parse_dataset_definition
   ...: config = {'type': 'kedro_mlflow.fake.fake.CSVDataset',
   ...:            'filepath': '/01_raw/iris.csv'}
   ...: 
   ...: parse_dataset_definition(config)

DatasetError: Class 'kedro_mlflow.fake.fake.CSVDataset' not found, is this a typo?

  • kedro-datasets is mandatory for 0.19.0, I don't think we need to worry about this @merelcht
  • custom datasets / plugins always requires full path, so the missing dependencies should be a good error message already.

Cc @datajoely for some inspiration

Comment on lines +379 to +386
dataset_type = config.pop("type")
if isinstance(dataset_type, str):
if len(dataset_type.strip(".")) != len(dataset_type):
raise DatasetError(
"'type' class path does not support relative "
"paths or paths ending with a dot."
)
class_paths = (prefix + class_obj for prefix in _DEFAULT_PACKAGES)
class_paths = (prefix + dataset_type for prefix in _DEFAULT_PACKAGES)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor the name because class_obj is overloaded and confusing. Originally it is a str but load as a class later, it's better to make it explicit.

Comment on lines +439 to +442
f"{exc}. Please see the documentation on how to "
f"install relevant dependencies for {class_path}:\n"
f"https://kedro.readthedocs.io/en/stable/"
f"kedro_project_setup/dependencies.html"
f"https://docs.kedro.org/en/stable/kedro_project_setup/"
f"dependencies.html#install-dependencies-related-to-the-data-catalog"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated the link according to review suggestion

Comment on lines +269 to +294
def test_dataset_name_typo(self, mocker):
# If the module doesn't exist, it return None instead ModuleNotFoundError
mocker.patch("kedro.io.core.load_obj", return_value=None)
dataset_name = "lAmbDaDaTAsET"

with pytest.raises(
DatasetError, match=f"Class '{dataset_name}' not found, is this a typo?"
):
parse_dataset_definition({"type": dataset_name})

def test_dataset_missing_dependencies(self, mocker):
# If the module is found but import the dataset trigger ModuleNotFoundError
dataset_name = "LambdaDataset"

def side_effect_function(value):
if "__all__" in value:
return [dataset_name]
else:
raise ModuleNotFoundError

mocker.patch("kedro.io.core.load_obj", side_effect=side_effect_function)

pattern = "Please see the documentation on how to install relevant dependencies"
with pytest.raises(DatasetError, match=pattern):
parse_dataset_definition({"type": dataset_name})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some tests for testing the two common scenarios.

@datajoely
Copy link
Contributor

So two thought -

  1. Great work- we've tried fixing this before and this is a MUCH better solution
  2. I wonder if we could include this part in @amandakys debugging research:

from kedro.io.core import parse_dataset_definition

One of the common debugging patterns I find myself supporting users is as follows:

  • A user is struggling to configure a YAML definition of a catalog entry (often with something tricky like credentials)
  • To isolate if the error is in YAML config or DataSet class implementation I often ask people to open a notebook and try instantiating the class imperatively.
  • Could we do something along the lines of generating some boilerplate for people to iterate against?
  • Once you have a working Python implementation could we have a .to_yaml() method?

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried breaking this in different ways and I'm satisfied with the outcome 👍🏽 Thanks @noklam!

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work @noklam, this is a huge improvement! 👍

@noklam noklam merged commit 45bfddb into develop Nov 9, 2023
34 of 35 checks passed
@noklam noklam deleted the noklam/make-import-failures-in-2943 branch November 9, 2023 23:46
@noklam noklam mentioned this pull request Dec 7, 2023
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make import failures in kedro-datasets clearer
4 participants