Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task sla and timeout support. #1317

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

t0momi219
Copy link
Contributor

Description

In Airflow, both DAGs and tasks can have timeout and SLA times specified. Since dbt models likely have varying expected execution times for each layer, there could be cases where users want to apply SLAs individually to each node.

How about having Cosmos retrieve timeout and SLA times from the node metadata and apply them individually when rendering nodes?

I’ve added parameters to RenderConfig to specify Timeout and SLA. Enabling these allows individual timeouts and SLAs to be applied to tasks.

Specifically, the expected time is specified in the model's config, which will be read accordingly.

version: 2
models:
  - name: stg_customers
    config:
      model_timeout: 30 -- New
  - name: stg_orders
    config:
      model_sla: 60 -- New
RenderConfig(
    model_timeout=True,
    model_sla=True,
)

Related Issue(s)

closes #1316

Breaking Change?

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Nov 12, 2024
@dosubot dosubot bot added the area:rendering Related to rendering, like Jinja, Airflow tasks, etc label Nov 12, 2024
Comment on lines +83 to +84
model_timeout: bool = False
model_sla: bool = False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds two parameters, but I’m unsure if it’s advisable to add too many parameters to Cosmos without careful consideration.

I’d like to request a review. What do you think? If you have any preferable ideas, please let me know. I’ll make adjustments accordingly and add documentation and tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering of whether we need this config. Can we not directly use timeouts if they're specified & otherwise not set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You’re absolutely right. Instead of adding parameters, it would have been better to set Timeout and SLA automatically only when metadata is present.

I wonder why I didn't think of it... thank you very much.

Copy link
Contributor

@pankajkoti pankajkoti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Can we show an example of this in one of the DAGs and potentially also mention about it somewhere in our docs?

logger.error(f'model_timeout: {node.config["model_timeout"]} in values')
args["execution_timeout"] = timedelta(seconds=int(node.config["model_timeout"]))
if model_sla and "model_sla" in node.config.keys():
args["sla"] = timedelta(seconds=int(node.config["model_sla"]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Airflow SLAs are known to have few issues.

They mentioned in AF 3 meeting notes that they would be dropping SLAs of AF 3 & come up with something better in later versions in AF 3+. I am thinking if we should we hold on adding SLA support here? WDYT?

It was agreed that for now, we will mark it for removal in Airflow 3.0 and adding it back in 3.1. We will assess again in the coming dev calls if something changes. Elad has also offered to help if needed.

Copy link
Contributor

@pankajkoti pankajkoti Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issues with the current SLA implementation are outlined here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=247828059#AIP57RefactorSLAFeature-ProblemsintheCurrentState.

I’m concerned that if we add SLA support in Cosmos now, any problems stemming from the underlying implementation could lead users to report it as a Cosmos error rather than an Airflow issue. Additionally, since SLAs are set to be removed in Airflow 3 and are undergoing a planned refactor per this AIP, it may be best to hold off on adding this for now, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand well. I’ll consider SLA support to be out of scope. I wasn’t aware of the upcoming features planned for Airflow 3.0. Thank you for letting me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:rendering Related to rendering, like Jinja, Airflow tasks, etc size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Task timeout and SLA support
2 participants