-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add task sla and timeout support. #1317
base: main
Are you sure you want to change the base?
Conversation
model_timeout: bool = False | ||
model_sla: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR adds two parameters, but I’m unsure if it’s advisable to add too many parameters to Cosmos without careful consideration.
I’d like to request a review. What do you think? If you have any preferable ideas, please let me know. I’ll make adjustments accordingly and add documentation and tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering of whether we need this config. Can we not directly use timeouts if they're specified & otherwise not set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re absolutely right. Instead of adding parameters, it would have been better to set Timeout and SLA automatically only when metadata is present.
I wonder why I didn't think of it... thank you very much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Can we show an example of this in one of the DAGs and potentially also mention about it somewhere in our docs?
logger.error(f'model_timeout: {node.config["model_timeout"]} in values') | ||
args["execution_timeout"] = timedelta(seconds=int(node.config["model_timeout"])) | ||
if model_sla and "model_sla" in node.config.keys(): | ||
args["sla"] = timedelta(seconds=int(node.config["model_sla"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Airflow SLAs are known to have few issues.
They mentioned in AF 3 meeting notes that they would be dropping SLAs of AF 3 & come up with something better in later versions in AF 3+. I am thinking if we should we hold on adding SLA support here? WDYT?
It was agreed that for now, we will mark it for removal in Airflow 3.0 and adding it back in 3.1. We will assess again in the coming dev calls if something changes. Elad has also offered to help if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issues with the current SLA implementation are outlined here: https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=247828059#AIP57RefactorSLAFeature-ProblemsintheCurrentState.
I’m concerned that if we add SLA support in Cosmos now, any problems stemming from the underlying implementation could lead users to report it as a Cosmos error rather than an Airflow issue. Additionally, since SLAs are set to be removed in Airflow 3 and are undergoing a planned refactor per this AIP, it may be best to hold off on adding this for now, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand well. I’ll consider SLA support to be out of scope. I wasn’t aware of the upcoming features planned for Airflow 3.0. Thank you for letting me know.
Description
In Airflow, both DAGs and tasks can have timeout and SLA times specified. Since dbt models likely have varying expected execution times for each layer, there could be cases where users want to apply SLAs individually to each node.
How about having Cosmos retrieve timeout and SLA times from the node metadata and apply them individually when rendering nodes?
I’ve added parameters to RenderConfig to specify Timeout and SLA. Enabling these allows individual timeouts and SLAs to be applied to tasks.
Specifically, the expected time is specified in the model's config, which will be read accordingly.
Related Issue(s)
closes #1316
Breaking Change?
Checklist