Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kedro marimo + better programmatic setup for non jupyter/ipython envs #4440

Open
lucharo opened this issue Jan 24, 2025 · 6 comments
Open

kedro marimo + better programmatic setup for non jupyter/ipython envs #4440

lucharo opened this issue Jan 24, 2025 · 6 comments
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@lucharo
Copy link

lucharo commented Jan 24, 2025

Description

I'm frustrated by Kedro's heavy reliance on IPython magics (%reload_kedro) for notebook setup. While this works for Jupyter, it creates barriers for:

  • Modern notebook interfaces like marimo that don't support IPython magics
  • Script-based workflows where magics aren't available, though kedro run is often used for those
  • Preferring explicit setup over magic injections

I first encountered this issue when importing data from the catalog in a marimo notebook. If kedro-project/ is the root of my project, my notebook sits in kedro-project/nbs/notebook.py. I find that if I run the notebook from kedro-project/ or from kedro-project/nbs/ the behaviour isn't the same. I currently use this code to load the config:

from kedro.io import DataCatalog
from kedro.config import OmegaConfigLoader
from kedro.framework.project import settings
from pathlib import Path
conf_loader = OmegaConfigLoader(
    conf_source=Path(__file__).parent /settings.CONF_SOURCE,
    default_run_env = "base"
)

catalog = DataCatalog.from_config(conf_loader["catalog"], credentials=conf_loader["credentials"])
mytable = catalog.load("mytable")

from kedro-project/nbs:

marimo edit notebook.py

-> ❌ fails as it thinks the filepath for mytable is relative to the current directory instead of the project's root

from kedro-project:

marimo edit nbs/notebook.py

-> ✅ works fine because we are in the project's root

my catalog looks something like:

mytable:
  type: pandas.SQLQueryDataset
  credentials: postgres_dwh
  filepath: conf/base/sql/mytable.sql

Context

To use kedro in marimo notebooks more easily. It has been suggested to parametrise the catalog with a file path but I don't think that's a very nice solution and hopefully this can be implemented automatically without magic outside of jupyter/ipython.

Possible Implementation

I've dug around the %reload_kedro magic code and I think we could re use some of that, I haven't fully tested it but I think it should work e.g.:

from kedro.framework.session import KedroSession
from kedro.framework.startup import bootstrap_project
from kedro.framework.project import configure_project
from kedro.utils import _find_kedro_project
from pathlib import Path

# Get project root
notebook_path = Path(__file__).resolve().parent
project_root = _find_kedro_project(notebook_path)

# Now create the session
with KedroSession.create(project_path=project_root) as session:
    context = session.load_context()
    catalog = context.catalog
    weekly_sales = catalog.load("mytable")

The point here is to make _find_kedro_project available in the public API and document this way of setting up a kedro session for non jupyter users.

PS

  • I could be wrong about some of the assumptions I have made, perhaps the issue is specific to catalog.load but I think kedro could benefit from a programmatic magic-free way of connecting to the project's config/catalog
  • I connected with @mscolnick from the marimo team on perhaps introducing kedro marimo command, I think it would be a nice merging of 2 worlds.
@lucharo lucharo added the Issue: Feature Request New feature or improvement to existing feature label Jan 24, 2025
@merelcht merelcht added the Community Issue/PR opened by the open-source community label Jan 24, 2025
@github-project-automation github-project-automation bot moved this to Wizard inbox in Kedro Wizard 🪄 Jan 24, 2025
@noklam
Copy link
Contributor

noklam commented Jan 24, 2025

I read the thread in marimo but I cannot signup/login to Discord so I can only comment here.

The majority of the magic function is making sure the root path is set correctly. They are not too important in the context of using absolute path but as Python cares where do you actually execute the command from.

The function itself is rather simple, it's mostly a syntax sugar to get started quickly but I suppose we should have some documentation on how to do this programmatically. As long as the session is created in the correct path, it should work.

What exactly is failing from marimo? In any case I think it will be an easy fix.

@noklam
Copy link
Contributor

noklam commented Jan 24, 2025

I read the thread in marimo but I cannot signup/login to Discord so I can only comment here.

The majority of the magic function is making sure the root path is set correctly. They are not too important in the context of using absolute path but as Python cares where do you actually execute the command from.

The function itself is rather simple, it's mostly a syntax sugar to get started quickly but I suppose we should have some documentation on how to do this programmatically. As long as the session is created in the correct path, it should work.

https://docs.kedro.org/en/stable/api/kedro.framework.session.session.KedroSession.html#kedro.framework.session.session.KedroSession

The main way to run kedro programmatically is to create a Kedro Session, this is what the magic command do behind the scene. The link provided an example, and basically you need to provide the correct path to the bootstrap function and you should be good to go.

@deepyaman
Copy link
Member

Agree with @noklam that you should be able to create your KedroSession even without the IPython magic: https://docs.kedro.org/en/stable/kedro_project_setup/session.html

The magic just makes it slightly more convenient.

I connected with @mscolnick from the marimo team on perhaps introducing kedro marimo command, I think it would be a nice merging of 2 worlds.

This is much broader scoped, but IMO would be very valuable. I actually started looking into this last year, but it would be very helpful to try to figure out what we want to accomplish. In the past, we had kedro jupyter convert (eventually deprecated due to low use), but the goal was to move people away from Jupyter notebooks into more production-ready data pipelines.

Marimo is different (I likelove the kedro marimo edit phrasing) in that Marimo code is just Python. Furthermore, Marimo notebooks following best practices (wrapping cells in functions, clearly defining inputs and outputs) map well to Kedro nodes and pipelines.

Obviously needs some more thought, but I think the right way to think about this might actually be:

  1. Programmatically convert a Kedro pipeline to a Marimo notebook (and view using the Marimo UI)

    There are some things that need to be figured out here, like what should map to a notebook (a single registered pipeline? a single modular pipeline?). I think a modular pipeline is a good start for a PoC.

    Also, I've always wondered how to best represent datasets. Does Marimo have a standardized way of representing external resources? Else this can just be Python function calls to read and write the data.

    Finally, does it have to be a .py file, or is there some in-memory representation (the AST)? One-way conversion is a very poor user experience, which brings us to...

  2. Figure out how to write back modifications to the Kedro project

A kedro marimo export could be a nice (easy) piece of functionality to add on, but this would honestly be the least useful thing to have.

@lucharo would also love to hear if this vision for a kedro marimo command—essentially Marimo as an editor for pipelines—is aligned with what would be useful to you, and what the Marimo team (@mscolnick @akshayka) are thinking/would be excited about.

Disclaimer: This is very much stream-of-consciousness and based on what I remember of looking into doing this in the past; may edit later.

@astrojuanlu
Copy link
Member

Thanks a lot @lucharo for opening this issue!

I want to log that ultimately this connected to Kedro's magic handling of relative paths in catalog.yml, first reported on #1934, discussed at length on a Tech Design session that didn't reach any substantive consensus #2924 (comment), and also in #2965

Your solution using _find_kedro_project looks appropriate 👍🏼 and maybe indeed the first step should be making it part of the public API.

@akshayka
Copy link

Does Marimo have a standardized way of representing external resources? Else this can just be Python function calls to read and write the data.

We don't currently, though it's an interesting idea.

Finally, does it have to be a .py file, or is there some in-memory representation (the AST)?

marimo does have an in-memory representation. There is no public API yet, but for the purposes of this integration we could experiment using internal APIs if needed

Figure out how to write back modifications to the Kedro project ... if this vision for a kedro marimo command—essentially Marimo as an editor for pipelines—is aligned with what would be useful to you, and what the Marimo team (@mscolnick @akshayka) are thinking/would be excited about.

"marimo as an editor for pipelines" is an exciting vision indeed. @lucharo, curious if it resonates with you?

@lucharo
Copy link
Author

lucharo commented Feb 3, 2025

Hi all, apologies for the delayed response. I am glad that opening this issue has sparked enthusiasm from both the Kedro and Marimo teams.

I see this issue breaking down into 3 sub-issues/potential Kedro-Marimo crossovers, let me explain:

  1. better compatibility of Kedro catalog with general scripts as well as non-jupyter notebook environment (e.g. Marimo - I don't know any other tbh). This would include making the _find_project function part of the public API and adding a piece of documentation on how to locate the config/catalog automatically within a Kedro project (without using ipython magic). I will try out the KedroSession + _find_project logic and see if I find any issue @noklam

  2. what @deepyaman brings up. I am a big fan of moving Kedro users away from jupyter notebooks. Currently, in my team, pipelines are developed in Jupyter and then turned into kedro pipelines/scripts. It's a very sub-optimal process and the development to production gap is too long. I am a big fan of kedro-only workflows (dev + production all in one).

Kedro and Marimo are already similar enough for the symbiosis that @deepyaman suggests to happen, you both define DAGs and isolate nodes within functions. @deepyaman suggests turning kedro into marimo notebooks and I would like to suggest doing the opposite. Support kedro as an output format for marimo: marimo export kedro pipeline.py

Finally "marimo as an editor for pipelines" is indeed an exciting vision @akshayka, does it align with marimo's roadmap/mission?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
Status: Wizard inbox
Development

No branches or pull requests

6 participants