New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

AIP-82 Save references between assets and triggers #43826

Open

vincbeck wants to merge 1 commit into apache:main from aws-mwaa:vincbeck/aip-82-save-references

+113 −8

Contributor

vincbeck commented Nov 8, 2024

Resolves #42510.

This PR adds a new attributes watchers to the Asset class and saves references between assets and triggers in the DB. For example:

trigger = SqsSensorTrigger(sqs_queue="my_queue")
asset = Asset("example_asset_watchers", watchers=[trigger])

with DAG(
    dag_id="example_dataset_watcher",
    schedule=[asset],
    catchup=False,
):
    task = EmptyOperator(task_id="task",)

    chain(task)

This PR creates the trigger in the DB if it does not exist and save the reference between asset and trigger.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

vincbeck requested review from uranusjr, Lee-W and dstandish

November 8, 2024 14:54

vincbeck requested review from jedcunningham, ephraimbuddy, XD-DENG and ashb as code owners

November 8, 2024 14:54

boring-cyborg bot added area:Scheduler area:task-sdk labels

vincbeck commented

View reviewed changes

task_sdk/src/airflow/sdk/definitions/dag.py Show resolved Hide resolved

vincbeck force-pushed the vincbeck/aip-82-save-references branch from 947e028 to 598bc69 Compare

November 8, 2024 16:01

Contributor Author

vincbeck commented Nov 8, 2024 •

edited

Loading

@Lee-W @uranusjr When working on it I realized that assets are added in the DB from DAG definition but never removed (or at least I did not see the code). Meaning, as a DAG author if I define an asset in my DAG and then later on remove it, the asset is never removed from the DB. Am I wrong? If not, is it intended?

Member

Lee-W commented Nov 9, 2024

@Lee-W @uranusjr When working on it I realized that assets are added in the DB from DAG definition but never removed (or at least I did not see the code). Meaning, as a DAG author if I define an asset in my DAG and then later on remove it, the asset is never removed from the DB. Am I wrong? If not, is it intended?

Yep, this is by design as of now. To keep the asset history.

vincbeck force-pushed the vincbeck/aip-82-save-references branch 2 times, most recently from d27cab5 to 682b713 Compare

November 12, 2024 15:29

Contributor Author

vincbeck commented Nov 12, 2024

@Lee-W @uranusjr When working on it I realized that assets are added in the DB from DAG definition but never removed (or at least I did not see the code). Meaning, as a DAG author if I define an asset in my DAG and then later on remove it, the asset is never removed from the DB. Am I wrong? If not, is it intended?

Yep, this is by design as of now. To keep the asset history.

Alright, thank you. I handled it then. I removed the references from asset and triggers if the asset is no longer used

Contributor Author

vincbeck commented Nov 13, 2024

@Lee-W Any chance you can review it? You have some experience around assets that could be interesting to have :)

Member

Lee-W commented Nov 14, 2024

@Lee-W Any chance you can review it? You have some experience around assets that could be interesting to have :)

Sure thing :) Will take a look later today

Lee-W reviewed

View reviewed changes

airflow/assets/__init__.py Outdated Show resolved Hide resolved

airflow/assets/__init__.py Outdated Show resolved Hide resolved

airflow/dag_processing/collection.py Outdated Show resolved Hide resolved

airflow/dag_processing/collection.py Outdated

+                              # Create the trigger in the DB if it does not exist
+                              if not trigger_model:
+                                  trigger_model = Trigger.from_object(trigger_class_path_to_asset_dict[trigger_class_path])
+                                  session.add(trigger_model)

Member

Lee-W Nov 14, 2024

Not sure whether collect all the models together and use add_all would be better 🤔

Member

uranusjr Nov 14, 2024

Collecting all model objects first is cleaner code IMO; this loop + add + append approach is a lot more difficult to read. Also the repeated scalar + limit call is not very performant; it is better to select all the existing triggers first in one query.

Contributor Author

vincbeck Nov 14, 2024

Thank you for your suggestions, I tried to apply them. Please let me know if this is what you thought

uranusjr reviewed

View reviewed changes

airflow/dag_processing/collection.py Outdated Show resolved Hide resolved

uranusjr reviewed

View reviewed changes

airflow/dag_processing/collection.py Outdated

Comment on lines 471 to 477

+                      # Remove references from assets no longer used
+                      all_assets = session.scalars(select(AssetModel))
+                      # orphan_assets = set()
+                      for asset_model in all_assets:
+                          if (asset_model.name, asset_model.uri) not in self.assets:
+                              asset_model.triggers = []
+                              # orphan_assets.add(asset_model.id)

Member

uranusjr Nov 14, 2024

Do we need to do this actively? What happens if we just leave those associations there?

Contributor Author

vincbeck Nov 14, 2024

Then the trigger will keep updating the asset in cases of events. More importantly, if we keep the association between the asset and the trigger, it will be impossible to clean-up these triggers. I want to be able to remove triggers that are not used (meaning, not associated to a task and an asset). Which means they will keep infinitely pooling an external resource. That could be very costly.

On that same topic, when doing some testing, I noticed that this function is called per DAG (am I wrong?). As a consequence, this piece of code removes the associations I just created before. I need to fix that

Contributor Author

vincbeck Nov 14, 2024

Fixed

vincbeck force-pushed the vincbeck/aip-82-save-references branch from 682b713 to c2660ea Compare

November 14, 2024 19:27

vincbeck requested a review from hussein-awala as a code owner

November 14, 2024 19:27

vincbeck force-pushed the vincbeck/aip-82-save-references branch 3 times, most recently from fe5d227 to 9543a51 Compare

November 14, 2024 20:24


          AIP-82 Save references between assets and triggers

c4c5c3e

vincbeck force-pushed the vincbeck/aip-82-save-references branch from 9543a51 to c4c5c3e Compare

November 14, 2024 20:59

Lee-W reviewed

View reviewed changes

airflow/dag_processing/collection.py

+                              ]
+                      # Remove references from assets no longer used
+                      orphan_assets = session.scalars(

Member

Lee-W Nov 15, 2024

@uranusjr do we need to check AssetActive here?

Member

uranusjr Nov 15, 2024

An asset without an AssetActive entry is not referenced anywhere, and trigerring an event to such an asset will therefore simply do nothing. So not checking AssetActive here is not useful in practice, but maybe theoratically a possibility? It depends on what we want the user to be able to do, I guess. @vincbeck Do you think a user should be able to trigger an event on an asset that does not actually exist in any DAGs?

Contributor Author

vincbeck Nov 15, 2024

Interesting, I did not know that notion of AssetActive, maybe I could use it.

Do you think a user should be able to trigger an event on an asset that does not actually exist in any DAGs?

Absolutely not, that's what I am doing (or trying to do) here but even further. orphan_assets contains all the assets not used by any DAGs as schedule. In other words, no DAG use an asset in orphan_assets as schedule condition. I am removing all references from these assets since they are not used to schedule DAG. The way I understand it is, all assets with an AssetActive entry right is a subset of orphan_assets ?

uranusjr reviewed

View reviewed changes

airflow/assets/__init__.py

+                      *,
+                      group: str = "",
+                      extra: dict | None = None,
+                      watchers: list[BaseTrigger] | None = None,

Member

uranusjr Nov 15, 2024

I wonder if watcher is a good name for this. What do we expect this to do? If I understand AIP-82 correctly, an external event would fire the trigger, and the trigger would create events for assets associated to it.

Assuming my understanding is correct, the triggers here are not watchers of the asset; rather, the asset watches the triggers. The relationship is the other way around. So it is probably better to call this watch instead? Or maybe this attribute should live on the trigger instead, something like

asset = Asset("example_asset_watchers")

trigger = SqsSensorTrigger(sqs_queue="my_queue", trigger=[asset])

DAG(..., schedule=[asset])

Tell me what you think on this.

Contributor Author

vincbeck Nov 15, 2024

Naming ... so hard haha. I see your point.

The reason why I called it watchers is because the triggers will watch some external resource and send event on updates. In that sense, to me, the triggers are watchers. I am not strongly again watch if you think it makes more sense. To be very honest, between watchers and watch I dont mind, I think the both of them makes sense.

However, I definitely want the attribute on the asset class, I think it makes more sense and a more deliberate choice for the user to say, I have this asset and I want this asset to be updated when these triggers fire.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Lee-W Lee-W left review comments

uranusjr uranusjr left review comments

dstandish Awaiting requested review from dstandish dstandish is a code owner

jedcunningham Awaiting requested review from jedcunningham jedcunningham is a code owner

ephraimbuddy Awaiting requested review from ephraimbuddy ephraimbuddy is a code owner

XD-DENG Awaiting requested review from XD-DENG XD-DENG is a code owner

ashb Awaiting requested review from ashb ashb is a code owner

hussein-awala Awaiting requested review from hussein-awala hussein-awala is a code owner

At least 1 approving review is required to merge this pull request.

Labels

area:Scheduler area:task-sdk