Add task sla and timeout support. #1317

t0momi219 · 2024-11-12T07:36:00Z

Description

In Airflow, both DAGs and tasks can have timeout and SLA times specified. Since dbt models likely have varying expected execution times for each layer, there could be cases where users want to apply SLAs individually to each node.

How about having Cosmos retrieve timeout and SLA times from the node metadata and apply them individually when rendering nodes?

I’ve added parameters to RenderConfig to specify Timeout and SLA. Enabling these allows individual timeouts and SLAs to be applied to tasks.

Specifically, the expected time is specified in the model's config, which will be read accordingly.

version: 2
models:
  - name: stg_customers
    config:
      model_timeout: 30 -- New
  - name: stg_orders
    config:
      model_sla: 60 -- New

RenderConfig(
    model_timeout=True,
    model_sla=True,
)

Related Issue(s)

closes #1316

Breaking Change?

Checklist

I have made corresponding changes to the documentation (if required)
I have added tests that prove my fix is effective or that my feature works

t0momi219 · 2024-11-12T07:40:20Z

cosmos/config.py

+    model_timeout: bool = False
+    model_sla: bool = False


This PR adds two parameters, but I’m unsure if it’s advisable to add too many parameters to Cosmos without careful consideration.

I’d like to request a review. What do you think? If you have any preferable ideas, please let me know. I’ll make adjustments accordingly and add documentation and tests.

I'm wondering of whether we need this config. Can we not directly use timeouts if they're specified & otherwise not set?

You’re absolutely right. Instead of adding parameters, it would have been better to set Timeout and SLA automatically only when metadata is present.

I wonder why I didn't think of it... thank you very much.

pankajkoti

Good idea. Can we show an example of this in one of the DAGs and potentially also mention about it somewhere in our docs?

pankajkoti · 2024-11-12T07:45:30Z

cosmos/airflow/graph.py

+                logger.error(f'model_timeout: {node.config["model_timeout"]} in values')
+                args["execution_timeout"] = timedelta(seconds=int(node.config["model_timeout"]))
+            if model_sla and "model_sla" in node.config.keys():
+                args["sla"] = timedelta(seconds=int(node.config["model_sla"]))


Airflow SLAs are known to have few issues.

They mentioned in AF 3 meeting notes that they would be dropping SLAs of AF 3 & come up with something better in later versions in AF 3+. I am thinking if we should we hold on adding SLA support here? WDYT?

It was agreed that for now, we will mark it for removal in Airflow 3.0 and adding it back in 3.1. We will assess again in the coming dev calls if something changes. Elad has also offered to help if needed.

The issues with the current SLA implementation are outlined here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=247828059#AIP57RefactorSLAFeature-ProblemsintheCurrentState.

I’m concerned that if we add SLA support in Cosmos now, any problems stemming from the underlying implementation could lead users to report it as a Cosmos error rather than an Airflow issue. Additionally, since SLAs are set to be removed in Airflow 3 and are undergoing a planned refactor per this AIP, it may be best to hold off on adding this for now, IMO.

I understand well. I’ll consider SLA support to be out of scope. I wasn’t aware of the upcoming features planned for Airflow 3.0. Thank you for letting me know.

Add task sla and timeout support.

5e54e08

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Nov 12, 2024

t0momi219 had a problem deploying to external November 12, 2024 07:36 — with GitHub Actions Error

🎨 [pre-commit.ci] Auto format from pre-commit.com hooks

054e291

dosubot bot added the area:rendering Related to rendering, like Jinja, Airflow tasks, etc label Nov 12, 2024

pre-commit-ci bot requested a deployment to external November 12, 2024 07:36 Waiting

t0momi219 commented Nov 12, 2024

View reviewed changes

pankajkoti reviewed Nov 12, 2024

View reviewed changes

pankajkoti requested review from tatiana and pankajastro November 12, 2024 07:48

t0momi219 mentioned this pull request Nov 13, 2024

To support task display_name #1278

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add task sla and timeout support. #1317

Add task sla and timeout support. #1317

t0momi219 commented Nov 12, 2024

t0momi219 Nov 12, 2024

pankajkoti Nov 12, 2024

t0momi219 Nov 12, 2024

pankajkoti left a comment

pankajkoti Nov 12, 2024

pankajkoti Nov 13, 2024 •

edited

Loading

t0momi219 Nov 13, 2024

Add task sla and timeout support. #1317

Are you sure you want to change the base?

Add task sla and timeout support. #1317

Conversation

t0momi219 commented Nov 12, 2024

Description

Related Issue(s)

Breaking Change?

Checklist

t0momi219 Nov 12, 2024

Choose a reason for hiding this comment

pankajkoti Nov 12, 2024

Choose a reason for hiding this comment

t0momi219 Nov 12, 2024

Choose a reason for hiding this comment

pankajkoti left a comment

Choose a reason for hiding this comment

pankajkoti Nov 12, 2024

Choose a reason for hiding this comment

pankajkoti Nov 13, 2024 • edited Loading

Choose a reason for hiding this comment

t0momi219 Nov 13, 2024

Choose a reason for hiding this comment

pankajkoti Nov 13, 2024 •

edited

Loading