Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Prototype] Node grouping #4427

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

[Prototype] Node grouping #4427

wants to merge 7 commits into from

Conversation

ankatiyar
Copy link
Contributor

@ankatiyar ankatiyar commented Jan 17, 2025

Description

Draft PR to demo - Resolve #4376

Development notes

Add attribute to pipeline to get dependencies and node grouped by namespace. (Suggestion on the name of the attribute welcome!)

This PR offers a version of the information from the group_by_namespace() method added in kedro-airflow in kedro-org/kedro-plugins#981

Questions to be considered:

  • Does it make sense to have this API in the Pipeline class?
  • Does the output of this make sense? Dict is returned with the following keys:
    • name: name of the node or namespace
    • type: either node or namespace
    • nodes: list - either one node for nodes that are not namespaced, or a list of nodes under that namespace
    • dependencies: either node or the namespace that this node depends on
  • Will probably split and parameterise the test

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@ankatiyar ankatiyar requested a review from DimedS January 17, 2025 15:08
Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @ankatiyar! That’s a great algorithm!

I have a few questions and comments:

  1. I think the Pipeline class is an excellent place for this API.

  2. I agree with you that we should not return just the names but also include object types. This would allow the plugin to understand what exactly needs to be executed (e.g., a node or a namespace).

  3. Additionally, I think it would be more beneficial to return a topologically sorted list (similar to how pipeline.nodes currently works) instead of a dictionary. This way, the list can be executed in order.

    Specifically, the format could be:
    [object_name, object_type, full_list_of_nodes], where:

    • object_name: The name of the object, such as a namespace or a node.
    • object_type: The type of the object, either a namespace or a node.
    • full_list_of_nodes: A list of all nodes included under the object_name. For example, if the object_name is a namespace, it will contain all the nodes within that namespace. If the object_name is a node, this list would just contain the node itself.
    • list_of_dependencies (to consider): Perhaps we should include the list of dependencies in the same list, rather than separating it into another method. This way, all the information needed for deployment would be available in a single call.

    Example structure:

    [
        [ns1, namespace, [n1, n2, n3]],  # Namespace containing nodes n1, n2, n3
        [n4, node, [n4]]                 # Single node n4
    ]

    Each element in the list would represent one deployment step that plugin should create. The full_list_of_nodes is included for informational purposes and isn’t required for execution.

We could also consider whether it would be beneficial to avoid coding the logic for handling node grouping separately in each plugin. Instead, we could provide a generic API in Kedro, allowing plugins to query Kedro for what should be deployed. This API could accept optional parameters, such as the type of node grouping, and return a list of objects to be deployed. The plugin’s responsibility would then simply be to take this list and handle the pipeline conversion, streamlining the process.

Lastly, a small question: can node.name be equal to namespace.name? If so, how would the algorithm handle this scenario?

Returns:
The pipeline nodes dependencies grouped by namespace.
"""
node_dependencies_by_namespace = defaultdict(dict)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: maybe should be defaultdict(set) to avoid lines 407-409

Copy link
Member

@DimedS DimedS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comments, @ankatiyar! It looks good to me - let's see how it works with Airflow!

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
@ankatiyar ankatiyar marked this pull request as ready for review February 3, 2025 15:50
@ankatiyar ankatiyar requested a review from merelcht as a code owner February 3, 2025 15:50
@ankatiyar ankatiyar changed the title [Draft] Node grouping [Prototype] Node grouping Feb 3, 2025
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good and I do think it fits within Pipeline.

Am I correct in thinking that if we add this method in the framework there's then no longer a need to add methods into plugins as well, like you now did in airflow in https://github.com/kedro-org/kedro-plugins/pull/981/files?

I was wondering what the "type" is used for?

@ankatiyar
Copy link
Contributor Author

Am I correct in thinking that if we add this method in the framework there's then no longer a need to add methods into plugins as well, like you now did in airflow in https://github.com/kedro-org/kedro-plugins/pull/981/files?

There would still need to be some code added to the plugins to call this method but the plugins wouldn't need to implement the logic to figure out the groups and the dependencies

I was wondering what the "type" is used for?

I believe it was for plugins to determine if the group is a node to then call kedro run --nodes <nodename> or kedro run --namespace <namespace> but I think we could do without it too and simply pass a list of nodes to kedro run --nodes <node_1>,<node_2>

@merelcht
Copy link
Member

merelcht commented Feb 4, 2025

There would still need to be some code added to the plugins to call this method but the plugins wouldn't need to implement the logic to figure out the groups and the dependencies

Yes that makes sense 👍

I believe it was for plugins to determine if the group is a node to then call kedro run --nodes <nodename> or kedro run --namespace <namespace> but I think we could do without it too and simply pass a list of nodes to kedro run --nodes <node_1>,<node_2>

Ah okay. I don't think it functionally makes a difference, but might be easier for the user to understand what's happening if we clearly differentiate between --namespace and --nodes. Or is this not something the user will see?

I've played around with the airflow prototype a bit, and one thing I feel like is important here is to make sure that the namespace grouping is actually executable. If a user makes a mistake in grouping and for example forgets to add a node in a namespace which is required for the rest to run, we should error out as early on as possible so that the user doesn't have to go through a whole deployment and only then discover the grouping was wrong.

Copy link
Contributor

@ElenaKhaustova ElenaKhaustova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments on complexity, otherwise looks good to me

grouped_nodes[key] = {}
grouped_nodes[key]["name"] = key
grouped_nodes[key]["type"] = "namespace" if node.namespace else "node"
grouped_nodes[key]["nodes"] = [*grouped_nodes[key].get("nodes", []), node]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have O(Nˆ2) complexity here, as every time we add a node, we recreate a list based on what was already there. So maybe it makes sense to initialize grouped_nodes[key]["nodes"] and then just append to it.

continue
else:
dependencies.add(parent.name)
grouped_nodes[key]["dependencies"] = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here is the same as the above

@ElenaKhaustova
Copy link
Contributor

I think it would be more beneficial to return a topologically sorted list (similar to how pipeline.nodes currently works) instead of a dictionary. This way, the list can be executed in order.

Since dictionaries are ordered by default starting from python3.8 the resulting dictionary is topologically sorted as we build it by iterating topologically sorted nodes. Also, dictionary representation looks more convenient to me as the whole namespace can be accessed.

@ElenaKhaustova
Copy link
Contributor

I've played around with the airflow prototype a bit, and one thing I feel like is important here is to make sure that the namespace grouping is actually executable. If a user makes a mistake in grouping and for example forgets to add a node in a namespace which is required for the rest to run

That's a valid point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prototype for node grouping deployment solution using namespaces
4 participants