Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make import failures in kedro-datasets clearer #2943

Closed
astrojuanlu opened this issue Mar 22, 2023 · 20 comments · Fixed by #3273 or #3272
Closed

Make import failures in kedro-datasets clearer #2943

astrojuanlu opened this issue Mar 22, 2023 · 20 comments · Fixed by #3273 or #3272
Assignees

Comments

@astrojuanlu
Copy link
Member

astrojuanlu commented Mar 22, 2023

Description

Note

See #2943 (comment) for current status

Disambiguate the "module does not exist" error from the ImportError in messages like:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

This task will require investigation into how the error messages are swallowed, which will inform the proper implementation. Probably change:

def _load_obj(class_path: str) -> object | None:

Context

This week I've been battling some puzzling documentation errors and after a while I noticed that, if the dependencies of a particular dataset are not present, the ImportError is swallowed silently. Examples:

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/biosequence/__init__.py#L7-L8

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/networkx/__init__.py#L8-L15

This was done in https://github.com/quantumblacklabs/private-kedro/pull/575 to solve https://github.com/quantumblacklabs/private-kedro/issues/563 at the same time dependencies were moved to extras_require.

I see how not suppressing these errors could be extremely annoying back then, because kedro.io used to re-export all the datasets in its __init__.py:

https://github.com/quantumblacklabs/private-kedro/blob/f7dd2478aec4de1b46afbaded9bce3c69bff6304/kedro/io/__init__.py#L29-L47

# kedro/io/__init__.py

"""``kedro.io`` provides functionality to read and write to a
number of data sets. At core of the library is ``AbstractDataSet``
which allows implementation of various ``AbstractDataSet``s.
"""

from .cached_dataset import CachedDataSet  # NOQA
from .core import AbstractDataSet  # NOQA
from .core import AbstractVersionedDataSet  # NOQA
from .core import DataSetAlreadyExistsError  # NOQA
from .core import DataSetError  # NOQA
from .core import DataSetNotFoundError  # NOQA
from .core import Version  # NOQA
from .data_catalog import DataCatalog  # NOQA
from .data_catalog_with_default import DataCatalogWithDefault  # NOQA
from .lambda_data_set import LambdaDataSet  # NOQA
from .memory_data_set import MemoryDataSet  # NOQA
from .partitioned_data_set import IncrementalDataSet  # NOQA
from .partitioned_data_set import PartitionedDataSet  # NOQA
from .transformers import AbstractTransformer  # NOQA

However, now our __init__.py is empty and datasets are meant to be imported separately:

https://github.com/kedro-org/kedro-plugins/blob/b8881d113f8082ff03e0233db3ae4557a4c32547/kedro-datasets/kedro_datasets/__init__.py#L1-L3

So I think it would be much better if we did not silence those import errors.

More context

If one dependency is missing, the user would get an unhelpful "module X has no attribute Y" when trying to import a dataset rather than an actual error:

> pip uninstall biopython                          (kedro-dev) 
Found existing installation: biopython 1.81
Uninstalling biopython-1.81:
  Would remove:
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/Bio/*
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/BioSQL/*
    /Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/biopython-1.81.dist-info/*
Proceed (Y/n)? y
  Successfully uninstalled biopython-1.81
> python                                           (kedro-dev) 
Python 3.10.9 | packaged by conda-forge | (main, Feb  2 2023, 20:26:08) [Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from kedro_datasets.biosequence import BioSequenceDataSet
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: cannot import name 'BioSequenceDataSet' from 'kedro_datasets.biosequence' (/Users/juan_cano/.micromamba/envs/kedro-dev/lib/python3.10/site-packages/kedro_datasets/biosequence/__init__.py)
@noklam
Copy link
Contributor

noklam commented Mar 30, 2023

Last time I saw this I was thinking the same, it turns out to be not trivial because of this. Not sure if things are still the same but worth mentioning.

kedro/kedro/io/core.py

Lines 381 to 389 in 38dfe6f

if isinstance(class_obj, str):
if len(class_obj.strip(".")) != len(class_obj):
raise DataSetError(
"'type' class path does not support relative "
"paths or paths ending with a dot."
)
class_paths = (prefix + class_obj for prefix in _DEFAULT_PACKAGES)
trials = (_load_obj(class_path) for class_path in class_paths)

@astrojuanlu
Copy link
Member Author

Thanks @noklam - I see that an error saying "some of the dependencies have not been installed" is raised if the desired class is the list of_DEFAULT_PACKAGES (which includes "", hence allowing the user to type the full import path) is exhausted:

kedro/kedro/io/core.py

Lines 392 to 396 in 38dfe6f

except StopIteration as exc:
raise DataSetError(
f"Class '{class_obj}' not found or one of its dependencies "
f"has not been installed."
) from exc

I see how we would not like to bubble all ImportErrors to the user because in most cases, only 1 of those prefixes will have the desired class. Maybe we'd want to bubble up ModuleNotFoundError errors, but definitely I'd have to think this through.

@noklam
Copy link
Contributor

noklam commented Mar 30, 2023

@astrojuanlu I think I come to similar conclusion, actually I am not sure why the issue disappear I thought I created it but I couldn't find it...

@noklam
Copy link
Contributor

noklam commented Mar 30, 2023

I would push for this if it's possible. Definitely one of the most annoying thing and as a beginner you will hit this issue for sure (almost as annoying as the DataSet vs Dataset typo issue)

@astrojuanlu astrojuanlu changed the title Do not silently hide import errors in kedro-datasets Import failures in kedro-datasets do not give clear instructions of what went wrong Aug 17, 2023
@astrojuanlu
Copy link
Member Author

Updated the title of this issue and adding more evidence:

https://linen-slack.kedro.org/t/14153932/hey-team-i-m-getting-the-following-exception-an-exception-oc#8b6debb1-ed6f-4a16-9440-5a03d0370b07

Hey team, I'm getting the following exception:

An exception occurred when parsing config for dataset 'raw_web@delta':
Object 'DeltaTableDataSet' cannot be loaded from 'kedro.extras.datasets.spark'. Please see the documentation on how to install relevant dependencies for kedro.extras.datasets.spark.DeltaTableDataSet:
<https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html>

If I go to the link, it doesn't tell me what I should be installing.
I already have the following in the requirements.txt: kedro-datasets[spark,delta]

Indeed, https://kedro.readthedocs.io/en/stable/kedro_project_setup/dependencies.html contains a lot and is difficult to parse.

I believe the same thing happened just this morning to @NeroOkwa.

Maybe it would be better if the underlying ImportError could be shown.

In any case, this is something that should be changed on Kedro, so I'm moving the issue.

@astrojuanlu astrojuanlu transferred this issue from kedro-org/kedro-plugins Aug 17, 2023
@noklam
Copy link
Contributor

noklam commented Aug 17, 2023

Remove all the with suppress(ImportError):.

Without reading the full context, are you aware of this is merged already? I think the error is surfaced now already (with kedro-datasets). I haven't fully tested it tho.

@astrojuanlu
Copy link
Member Author

This is what I see now:

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

doesn't tell me whether I made a typo, or what dependency I need to install. Am I missing something? This is with kedro-datasets 1.5.2

@noklam
Copy link
Contributor

noklam commented Aug 17, 2023

Will revisit that when we fixed the dataset installation issue. If this is not the case yet, I think we are in a good position to solve this finally.

@merelcht merelcht changed the title Import failures in kedro-datasets do not give clear instructions of what went wrong Make import failures in kedro-datasets clearer Aug 29, 2023
@merelcht merelcht moved this to To Do in Kedro Framework Sep 5, 2023
@noklam noklam moved this from To Do to In Progress in Kedro Framework Nov 6, 2023
@noklam
Copy link
Contributor

noklam commented Nov 6, 2023

I want to have more clarification of this ticket. Is this a kedro or kedro-datasets ticket? The description and discussion is slightly outdated because kedro-org/kedro-plugins#277 change how we import library.

The original error mentioned in the description is something in kedro rather than kedro-datasets

DatasetError: An exception occurred when parsing config for dataset 'companies':
Class 'polars.CSVDataSet' not found or one of its dependencies has not been installed.

from kedro_datasets.biosequence import BioSequenceDataSet

ModuleNotFoundError: No module named 'Bio'

@astrojuanlu Would you be able to give me an fail cases so I can investigate this?

@astrojuanlu
Copy link
Member Author

As far as I understand, the logic to import datasets still lives in Kedro:

kedro/kedro/io/core.py

Lines 389 to 395 in 93dc1a9

try:
class_obj = next(obj for obj in trials if obj is not None)
except StopIteration as exc:
raise DatasetError(
f"Class '{class_obj}' not found or one of its dependencies "
f"has not been installed."
) from exc

So in principle I think we can keep the ticket here, but up to you.

Let me give a reproducer.

@astrojuanlu
Copy link
Member Author

❯ kedro info                                                                                                                                                                                    (kedro310) 
As an open-source project, we collect usage analytics. 
We cannot see nor store information contained in a Kedro project. 
You can find out more by reading our privacy notice: 
https://github.com/kedro-org/kedro-plugins/tree/main/kedro-telemetry#privacy-notice 
Do you opt into usage analytics?  [y/N]: N
You have opted out of product usage analytics, so none will be collected.

 _            _
| | _____  __| |_ __ ___
| |/ / _ \/ _` | '__/ _ \
|   <  __/ (_| | | | (_) |
|_|\_\___|\__,_|_|  \___/
v0.18.14

Kedro is a Python framework for
creating reproducible, maintainable
and modular data science code.

Installed plugins:
kedro_telemetry: 0.2.4 (entry points:cli_hooks,hooks)
kedro_viz: 6.3.1 (entry points:global,line_magic)
❯ pip list | grep datasets                                                                                                                                                                      (kedro310) 
kedro-datasets                1.8.0
  1. kedro new --starter=pandas-iris
  2. kedro run (everything works fine)
  3. Change type: biosequence.BiosequenceDataSet in catalog.yml (deliberate typo)
  4. kedro run:
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
  1. (confusion)
  2. pip install kedro-datasets[biosequence]
  3. kedro run
DatasetError: An exception occurred when parsing config for dataset 'example_iris_data':
Class 'biosequence.BiosequenceDataSet' not found or one of its dependencies has not been installed.
  1. (confusion)
  2. change to type: biosequence.BioSequenceDataSet
  3. kedro run works ✔️

(notice that soon this will happen with DataSet vs Dataset

@github-project-automation github-project-automation bot moved this from In Progress to Done in Kedro Framework Nov 6, 2023
@noklam noklam reopened this Nov 6, 2023
@github-project-automation github-project-automation bot moved this from Done to In Progress in Kedro Framework Nov 6, 2023
@noklam
Copy link
Contributor

noklam commented Nov 6, 2023

😅 from now on I may just repharse partially fix to part of #xxx.

@noklam noklam moved this from In Progress to In Review in Kedro Framework Nov 8, 2023
@github-project-automation github-project-automation bot moved this from In Review to Done in Kedro Framework Nov 13, 2023
@astrojuanlu
Copy link
Member Author

Today we found another example of this giving an unhelpful error message:

pandas was present but kedro-datasets was not, so we got the "is this a typo?" error instead of the obvious "install kedro-datasets".

This was on top of kedro-org/kedro-plugins#402 (since that fix hasn't been released yet) and kedro-org/kedro-plugins#313.

@noklam should I open a new issue?

@noklam
Copy link
Contributor

noklam commented Feb 14, 2024

Do you have an example of that? Just checking how does this happen.

@astrojuanlu
Copy link
Member Author

@iamelijahko could you provide a series of steps to arrive to the error you saw initially?

@iamelijahko
Copy link

iamelijahko commented Feb 14, 2024

Screenshot 2024-02-14 at 17 39 56

The screenshot depicts a sequence where the installation of a Kedro-related package with pandas support is attempted but fails due to missing system-level dependencies (HDF5). Following this, an attempt to start a Kedro IPython session also fails due to an issue with loading dataset definitions, which may or may not be related to the prior installation issue.

  1. I tried to install kedro-datasets[pandas] with pip for Kedro and pandas use.
  2. Installation needed more packages. One, probably the tables package, required HDF5 and failed to build.
  3. The failure was due to missing HDF5 headers; the error mentioned 'H5public.h' not found.
  4. After this, I tried to start a Kedro IPython session, which didn't work.
  5. Kedro couldn't load the project because of a DatasetError.

@astrojuanlu
Copy link
Member Author

The installation issue was tracked in kedro-org/kedro-plugins#402

For clarity, the problem here is that the "is this a typo?" error message seems to hint that the dataset name is mistyped, when in fact the dependencies are missing.

I'm just reopening.

@astrojuanlu astrojuanlu reopened this Feb 14, 2024
@github-project-automation github-project-automation bot moved this from Done to In Progress in Kedro Framework Feb 14, 2024
@noklam
Copy link
Contributor

noklam commented Feb 14, 2024

For the record, the problem that we used to have is fixed, this is failing for a separate reason. This happened because the installation fails, so NOTHING is installed, not just hdf5. It is prompting for typo because kedro-datasets doesn't exist, it's not because of individual dataset dependency.

You will see same error if you put type: RANDOM.padnas.CSVDataset. We have two options here:

  1. Either treat kedro-datasets as a special case, and we check separately it kedro-datasets is importable.
  2. Remove the warning for typo, this is still not good because it will not warn "kedro-datasets" is not installed either.

I tend to think this is a niche case and the priority is lower. I remove the estimate and move it back to inbox

@astrojuanlu
Copy link
Member Author

Addressed in #3952 @ankatiyar ?

@astrojuanlu
Copy link
Member Author

I'm guessing so. Closing, but feel free to reopen if there's anything left.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment