Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for loading manifest from cloud store using Airflow Object Store #1109

Merged
merged 11 commits into from
Jul 24, 2024

Conversation

pankajkoti
Copy link
Contributor

@pankajkoti pankajkoti commented Jul 21, 2024

Summary

This PR introduces the capability to load manifest.json files from various cloud storage services using Airflow's Object Store integration. The supported cloud storages include AWS S3, Google Cloud Storage (GCS), and Azure Blob Storage. The feature allows seamless integration with remote paths, providing enhanced flexibility and scalability for managing DBT projects.

Key Changes

  1. Parameters in DbtDag and DbtTaskGroup:
    manifest_path: Accepts both local paths and remote URLs (e.g., S3, GCS, Azure Blob Storage).
    manifest_conn_id: (Optional) An Airflow connection ID for accessing the remote path.
  2. Automatic Detection of Storage Type:
    The system automatically identifies the storage service based on the scheme of the URL provided (e.g., s3://, gs://, abfs://) by integrating with Airflow Object Store
  3. If a manifest_conn_id is provided, it is used to fetch the necessary credentials.
  4. If no manifest_conn_id is provided, the default connection ID for the identified scheme is used.

Validation and Error Handling

  1. Validates the existence of the manifest.json file when a path is specified.
  2. Raises appropriate errors if a remote manifest_path is given but the required min Airflow version 2.8(Object Store feature) support is not available.

Backward Compatibility

Ensures compatibility with existing workflows that use local paths for the manifest.json.

How to Use

  1. Local Path:
DbtDag(
    project_config=ProjectConfig(
        manifest_path="/path/to/local/manifest.json",
    ),
    ...
)
  1. Remote Path (e.g., S3):
DbtDag(
    project_config=ProjectConfig(
        manifest_path="s3://bucket/path/to/manifest.json",
        manifest_conn_id="aws_s3_conn",
    ),
    ...
)
  1. Remote Path without Explicit Connection ID:
DbtDag(
    project_config=ProjectConfig(
        manifest_path="gs://bucket/path/to/manifest.json",
        # No manifest_conn_id provided, will use default Airflow GCS connection `google_cloud_default`
    ),
    ...
)

Additional Notes

  1. Ensure that the required Airflow version (2.8 or later) is used to take advantage of the Object Store features.
  2. Review the updated documentation for detailed usage instructions and examples.

Testing

  1. Added unit tests to cover various scenarios including local paths, remote paths with and without manifest_conn_id.
  2. Verified integration with different cloud storage services (AWS S3, GCS, Azure Blob Storage).
  3. Ensured backward compatibility with existing local path workflows.

Related Issue(s)

closes: #448

Breaking Change?

No.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

Copy link

netlify bot commented Jul 21, 2024

Deploy Preview for sunny-pastelito-5ecb04 ready!

Name Link
🔨 Latest commit ecee4db
🔍 Latest deploy log https://app.netlify.com/sites/sunny-pastelito-5ecb04/deploys/66a0a330e04d37000869023c
😎 Deploy Preview https://deploy-preview-1109--sunny-pastelito-5ecb04.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link

codecov bot commented Jul 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.50%. Comparing base (236488b) to head (77022c4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1109      +/-   ##
==========================================
+ Coverage   96.48%   96.50%   +0.02%     
==========================================
  Files          64       64              
  Lines        3301     3321      +20     
==========================================
+ Hits         3185     3205      +20     
  Misses        116      116              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pyproject.toml Outdated Show resolved Hide resolved
.github/workflows/test.yml Outdated Show resolved Hide resolved
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great, @pankajkoti , great work!

I know this PR is still in draft, but some early feedback:

  • docs: we need to update the docs (both on the ProjectConfig and also in the manifest parsing method, mentioning the minimum version of Cosmos/Airflow to support remote manifest locations
  • dependencies: probably worth have extras specific per cloud provider, similar to Airflow, so we do not to introduce new extras if we add additional improvements per cloud provider

@tatiana tatiana added this to the Cosmos 1.6.0 milestone Jul 23, 2024
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very waited feature, @pankajkoti , thanks a lot for addressing all the feedback! I left three minor comments, and we're good to merge.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jul 24, 2024
@pankajkoti pankajkoti merged commit b8a14a3 into main Jul 24, 2024
46 of 62 checks passed
@pankajkoti pankajkoti deleted the 448-remote-manifest-support branch July 24, 2024 09:19
dwreeves pushed a commit to dwreeves/astronomer-cosmos that referenced this pull request Jul 31, 2024
…t Store (astronomer#1109)

## Summary

This PR introduces the capability to load `manifest.json` files from
various cloud storage services using Airflow's Object Store integration.
The supported cloud storages include AWS S3, Google Cloud Storage (GCS),
and Azure Blob Storage. The feature allows seamless integration with
remote paths, providing enhanced flexibility and scalability for
managing DBT projects.

### Key Changes

1. Parameters in DbtDag and DbtTaskGroup:
`manifest_path`: Accepts both local paths and remote URLs (e.g., S3,
GCS, Azure Blob Storage).
`manifest_conn_id`: (Optional) An Airflow connection ID for accessing
the remote path.
2. Automatic Detection of Storage Type:
The system automatically identifies the storage service based on the
scheme of the URL provided (e.g., s3://, gs://, abfs://) by integrating
with Airflow Object Store
3. If a `manifest_conn_id` is provided, it is used to fetch the
necessary credentials.
4. If no `manifest_conn_id` is provided, the default connection ID for
the identified scheme is used.

### Validation and Error Handling

1. Validates the existence of the `manifest.json` file when a path is
specified.
2. Raises appropriate errors if a remote `manifest_path` is given but
the required min Airflow version 2.8(Object Store feature) support is
not available.


### Backward Compatibility
Ensures compatibility with existing workflows that use local paths for
the manifest.json.


### How to Use

1. Local Path:
```
DbtDag(
    project_config=ProjectConfig(
        manifest_path="/path/to/local/manifest.json",
    ),
    ...
)
```
2. Remote Path (e.g., S3):
```
DbtDag(
    project_config=ProjectConfig(
        manifest_path="s3://bucket/path/to/manifest.json",
        manifest_conn_id="aws_s3_conn",
    ),
    ...
)
```
3.  Remote Path without Explicit Connection ID:
```
DbtDag(
    project_config=ProjectConfig(
        manifest_path="gs://bucket/path/to/manifest.json",
        # No manifest_conn_id provided, will use default Airflow GCS connection `google_cloud_default`
    ),
    ...
)
```


### Additional Notes

1. Ensure that the required Airflow version (2.8 or later) is used to
take advantage of the Object Store features.
2. Review the updated documentation for detailed usage instructions and
examples.


### Testing

1. Added unit tests to cover various scenarios including local paths,
remote paths with and without manifest_conn_id.
2. Verified integration with different cloud storage services (AWS S3,
GCS, Azure Blob Storage).
3. Ensured backward compatibility with existing local path workflows.

## Related Issue(s)

closes: astronomer#448

## Breaking Change?
No.

## Checklist

- [x] I have made corresponding changes to the documentation (if
required)
- [x] I have added tests that prove my fix is effective or that my
feature works

---------

Co-authored-by: Tatiana Al-Chueyr <[email protected]>
@dwreeves
Copy link
Collaborator

dwreeves commented Aug 1, 2024

Hi, just noticed this one testing some other code and this caused my code to fail.

I don't think it is appropriate for Cosmos to create explicit dependencies on AWS, GCP, and Azure providers, which is what this code does by eagerly loading the providers into the level. This will be a disruption to a lot of users, and also adds additional bloat to their environments that they genuinely don't need.

@dwreeves
Copy link
Collaborator

dwreeves commented Aug 1, 2024

^ I've opened PR #1132 to address the issue raised in the above comment.

pankajkoti pushed a commit that referenced this pull request Aug 2, 2024
Making an update to #1109, which introduced module-level imports of
optional dependencies. This is inappropriate as it will break if the
user does not have them installed, and indeed the user really does not
need them installed if they are not relying on them directly.

This PR lazy-loads the imports so that it does not impact users who do
not need them.

In the upath library, `az:`, `adl:`, `abfs:` and `abfss:` are also all valid schemes, 
albeit Airflow only references the latter 3 in the code: https://github.com/apache/airflow/blob/e3824eaaba7eada9a807f7a2f9f89d977a210e15/airflow/providers/microsoft/azure/fs/adls.py#L29, so `adl:`, `abfs:` and `abfss:` also have been added
to the list of schemes supported.
dwreeves added a commit to dwreeves/astronomer-cosmos that referenced this pull request Aug 4, 2024
…#1109 (astronomer#1132)

Making an update to astronomer#1109, which introduced module-level imports of
optional dependencies. This is inappropriate as it will break if the
user does not have them installed, and indeed the user really does not
need them installed if they are not relying on them directly.

This PR lazy-loads the imports so that it does not impact users who do
not need them.

In the upath library, `az:`, `adl:`, `abfs:` and `abfss:` are also all valid schemes, 
albeit Airflow only references the latter 3 in the code: https://github.com/apache/airflow/blob/e3824eaaba7eada9a807f7a2f9f89d977a210e15/airflow/providers/microsoft/azure/fs/adls.py#L29, so `adl:`, `abfs:` and `abfss:` also have been added
to the list of schemes supported.
tatiana pushed a commit that referenced this pull request Aug 14, 2024
Making an update to #1109, which introduced module-level imports of
optional dependencies. This is inappropriate as it will break if the
user does not have them installed, and indeed the user really does not
need them installed if they are not relying on them directly.

This PR lazy-loads the imports so that it does not impact users who do
not need them.

In the upath library, `az:`, `adl:`, `abfs:` and `abfss:` are also all valid schemes, 
albeit Airflow only references the latter 3 in the code: https://github.com/apache/airflow/blob/e3824eaaba7eada9a807f7a2f9f89d977a210e15/airflow/providers/microsoft/azure/fs/adls.py#L29, so `adl:`, `abfs:` and `abfss:` also have been added
to the list of schemes supported.
pankajkoti added a commit that referenced this pull request Aug 20, 2024
New Features

* Add support for loading manifest from cloud stores using Airflow
Object Storage by @pankajkoti in #1109
* Cache ``package-lock.yml`` file by @pankajastro in #1086
* Support persisting the ``LoadMode.VIRTUALENV`` directory by @tatiana
in #1079
* Add support to store and fetch ``dbt ls`` cache in remote stores by
@pankajkoti in #1147
* Add default source nodes rendering by @arojasb3 in #1107
* Add Teradata ``ProfileMapping`` by @sc250072 in #1077

Enhancements

* Add ``DatabricksOauthProfileMapping`` profile by @CorsettiS in #1091
* Use ``dbt ls`` as the default parser when ``profile_config`` is
provided by @pankajastro in #1101
* Add task owner to dbt operators by @wornjs in #1082
* Extend Cosmos custom selector to support + when using paths and tags
by @mvictoria in #1150
* Simplify logging by @dwreeves in #1108

Bug fixes

* Fix Teradata ``ProfileMapping`` target invalid issue by @sc250072 in
#1088
* Fix empty tag in case of custom parser by @pankajastro in #1100
* Fix ``dbt deps`` of ``LoadMode.DBT_LS`` should use
``ProjectConfig.dbt_vars`` by @tatiana in #1114
* Fix import handling by lazy loading hooks introduced in PR #1109 by
@dwreeves in #1132
* Fix Airflow 2.10 regression and add Airflow 2.10 in test matrix by
@pankajastro in #1162

Docs

* Fix typo in azure-container-instance docs by @pankajastro in #1106
* Use Airflow trademark as it has been registered by @pankajastro in
#1105

Others

* Run some example DAGs in Kubernetes execution mode in CI by
@pankajastro in #1127
* Install requirements.txt by default during dev env spin up by
@@CorsettiS in #1099
* Remove ``DbtGraph.current_version`` dead code by @tatiana in #1111
* Disable test for Airflow-2.5 and Python-3.11 combination in CI by
@pankajastro in #1124
* Pre-commit hook updates in #1074, #1113, #1125, #1144, #1154,  #1167

---------

Co-authored-by: Pankaj Koti <[email protected]>
Co-authored-by: Pankaj Singh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:config Related to configuration, like YAML files, environment variables, or executer configuration area:docs Relating to documentation, changes, fixes, improvement area:testing Related to testing, like unit tests, integration tests, etc lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support loading manifest from S3/GCS/Azure BlobStorage
3 participants