[Prototype] Node grouping #4427

ankatiyar · 2025-01-17T15:08:30Z

Description

Draft PR to demo - Resolve #4376

Development notes

Add attribute to pipeline to get dependencies and node grouped by namespace. (Suggestion on the name of the attribute welcome!)

This PR offers a version of the information from the group_by_namespace() method added in kedro-airflow in kedro-org/kedro-plugins#981

Questions to be considered:

Does it make sense to have this API in the Pipeline class?
Does the output of this make sense? Dict is returned with the following keys:
- name: name of the node or namespace
- type: either node or namespace
- nodes: list - either one node for nodes that are not namespaced, or a list of nodes under that namespace
- dependencies: either node or the namespace that this node depends on
Will probably split and parameterise the test

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Ankita Katiyar <[email protected]>

DimedS

Thanks for the PR, @ankatiyar! That’s a great algorithm!

I have a few questions and comments:

I think the Pipeline class is an excellent place for this API.
I agree with you that we should not return just the names but also include object types. This would allow the plugin to understand what exactly needs to be executed (e.g., a node or a namespace).
Additionally, I think it would be more beneficial to return a topologically sorted list (similar to how pipeline.nodes currently works) instead of a dictionary. This way, the list can be executed in order.

Specifically, the format could be:
[object_name, object_type, full_list_of_nodes], where:
- object_name: The name of the object, such as a namespace or a node.
- object_type: The type of the object, either a namespace or a node.
- full_list_of_nodes: A list of all nodes included under the object_name. For example, if the object_name is a namespace, it will contain all the nodes within that namespace. If the object_name is a node, this list would just contain the node itself.
- list_of_dependencies (to consider): Perhaps we should include the list of dependencies in the same list, rather than separating it into another method. This way, all the information needed for deployment would be available in a single call.
Example structure:
```
[
    [ns1, namespace, [n1, n2, n3]],  # Namespace containing nodes n1, n2, n3
    [n4, node, [n4]]                 # Single node n4
]
```
Each element in the list would represent one deployment step that plugin should create. The full_list_of_nodes is included for informational purposes and isn’t required for execution.

We could also consider whether it would be beneficial to avoid coding the logic for handling node grouping separately in each plugin. Instead, we could provide a generic API in Kedro, allowing plugins to query Kedro for what should be deployed. This API could accept optional parameters, such as the type of node grouping, and return a list of objects to be deployed. The plugin’s responsibility would then simply be to take this list and handle the pipeline conversion, streamlining the process.

Lastly, a small question: can node.name be equal to namespace.name? If so, how would the algorithm handle this scenario?

DimedS · 2025-01-27T12:07:14Z

kedro/pipeline/pipeline.py

+        Returns:
+            The pipeline nodes dependencies grouped by namespace.
+        """
+        node_dependencies_by_namespace = defaultdict(dict)


minor: maybe should be defaultdict(set) to avoid lines 407-409

Signed-off-by: Ankita Katiyar <[email protected]>

DimedS

Thanks for addressing the comments, @ankatiyar! It looks good to me - let's see how it works with Airflow!

Signed-off-by: Ankita Katiyar <[email protected]>

merelcht

This looks good and I do think it fits within Pipeline.

Am I correct in thinking that if we add this method in the framework there's then no longer a need to add methods into plugins as well, like you now did in airflow in https://github.com/kedro-org/kedro-plugins/pull/981/files?

I was wondering what the "type" is used for?

ankatiyar · 2025-02-04T11:31:14Z

Am I correct in thinking that if we add this method in the framework there's then no longer a need to add methods into plugins as well, like you now did in airflow in https://github.com/kedro-org/kedro-plugins/pull/981/files?

There would still need to be some code added to the plugins to call this method but the plugins wouldn't need to implement the logic to figure out the groups and the dependencies

I was wondering what the "type" is used for?

I believe it was for plugins to determine if the group is a node to then call kedro run --nodes <nodename> or kedro run --namespace <namespace> but I think we could do without it too and simply pass a list of nodes to kedro run --nodes <node_1>,<node_2>

merelcht · 2025-02-04T14:09:29Z

There would still need to be some code added to the plugins to call this method but the plugins wouldn't need to implement the logic to figure out the groups and the dependencies

Yes that makes sense 👍

I believe it was for plugins to determine if the group is a node to then call kedro run --nodes <nodename> or kedro run --namespace <namespace> but I think we could do without it too and simply pass a list of nodes to kedro run --nodes <node_1>,<node_2>

Ah okay. I don't think it functionally makes a difference, but might be easier for the user to understand what's happening if we clearly differentiate between --namespace and --nodes. Or is this not something the user will see?

I've played around with the airflow prototype a bit, and one thing I feel like is important here is to make sure that the namespace grouping is actually executable. If a user makes a mistake in grouping and for example forgets to add a node in a namespace which is required for the rest to run, we should error out as early on as possible so that the user doesn't have to go through a whole deployment and only then discover the grouping was wrong.

ElenaKhaustova

Left some comments on complexity, otherwise looks good to me

ElenaKhaustova · 2025-02-04T17:42:41Z

kedro/pipeline/pipeline.py

+                grouped_nodes[key] = {}
+                grouped_nodes[key]["name"] = key
+                grouped_nodes[key]["type"] = "namespace" if node.namespace else "node"
+            grouped_nodes[key]["nodes"] = [*grouped_nodes[key].get("nodes", []), node]


I think we have O(Nˆ2) complexity here, as every time we add a node, we recreate a list based on what was already there. So maybe it makes sense to initialize grouped_nodes[key]["nodes"] and then just append to it.

ElenaKhaustova · 2025-02-04T17:43:15Z

kedro/pipeline/pipeline.py

+                    continue
+                else:
+                    dependencies.add(parent.name)
+            grouped_nodes[key]["dependencies"] = (


And here is the same as the above

ElenaKhaustova · 2025-02-04T18:00:54Z

I think it would be more beneficial to return a topologically sorted list (similar to how pipeline.nodes currently works) instead of a dictionary. This way, the list can be executed in order.

Since dictionaries are ordered by default starting from python3.8 the resulting dictionary is topologically sorted as we build it by iterating topologically sorted nodes. Also, dictionary representation looks more convenient to me as the whole namespace can be accessed.

ElenaKhaustova · 2025-02-04T18:05:26Z

I've played around with the airflow prototype a bit, and one thing I feel like is important here is to make sure that the namespace grouping is actually executable. If a user makes a mistake in grouping and for example forgets to add a node in a namespace which is required for the rest to run

That's a valid point

Add deployment related attributes

2ff33b2

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar requested a review from DimedS January 17, 2025 15:08

DimedS reviewed Jan 27, 2025

View reviewed changes

ankatiyar added 2 commits January 27, 2025 16:29

Merge branch 'main' into grouping

0d6461a

Update with feedback

ef4d664

Signed-off-by: Ankita Katiyar <[email protected]>

DimedS approved these changes Jan 30, 2025

View reviewed changes

ankatiyar added 2 commits January 31, 2025 16:48

Minor formatting

a299ef3

Signed-off-by: Ankita Katiyar <[email protected]>

Add test

b79f1f5

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar marked this pull request as ready for review February 3, 2025 15:50

ankatiyar requested a review from merelcht as a code owner February 3, 2025 15:50

ankatiyar and others added 2 commits February 3, 2025 15:51

Update test

a8b1f45

Signed-off-by: Ankita Katiyar <[email protected]>

Merge branch 'main' into grouping

9d736b9

ankatiyar changed the title ~~[Draft] Node grouping~~ [Prototype] Node grouping Feb 3, 2025

ankatiyar requested review from lrcouto and ElenaKhaustova February 3, 2025 15:52

merelcht reviewed Feb 4, 2025

View reviewed changes

ElenaKhaustova reviewed Feb 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Prototype] Node grouping #4427

[Prototype] Node grouping #4427

ankatiyar commented Jan 17, 2025 •

edited

Loading

DimedS left a comment

DimedS Jan 27, 2025

DimedS left a comment

merelcht left a comment

ankatiyar commented Feb 4, 2025

merelcht commented Feb 4, 2025

ElenaKhaustova left a comment

ElenaKhaustova Feb 4, 2025

ElenaKhaustova Feb 4, 2025

ElenaKhaustova commented Feb 4, 2025

ElenaKhaustova commented Feb 4, 2025

[Prototype] Node grouping #4427

Are you sure you want to change the base?

[Prototype] Node grouping #4427

Conversation

ankatiyar commented Jan 17, 2025 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

DimedS left a comment

Choose a reason for hiding this comment

DimedS Jan 27, 2025

Choose a reason for hiding this comment

DimedS left a comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

ankatiyar commented Feb 4, 2025

merelcht commented Feb 4, 2025

ElenaKhaustova left a comment

Choose a reason for hiding this comment

ElenaKhaustova Feb 4, 2025

Choose a reason for hiding this comment

ElenaKhaustova Feb 4, 2025

Choose a reason for hiding this comment

ElenaKhaustova commented Feb 4, 2025

ElenaKhaustova commented Feb 4, 2025

ankatiyar commented Jan 17, 2025 •

edited

Loading