Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Azure blob storage support #1842

Merged

Conversation

Tom-Newton
Copy link
Contributor

@Tom-Newton Tom-Newton commented Sep 19, 2023

TL;DR

Support Azure blob storage for storing metadata and raw data including structured datasets, without interrupting other standard Azure authentication. Also ensure storage_options are set consistently for all uses of fsspec.

Type

Its debatable. The original github issue was a bug for fixing Azure authentication but it could be argued that Azure blob storage support is a new feature.

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested - Not really sure what is expected but I have used these change to deploy real workflows.
  • Unit tests added - I added a bit of new coverage, e.g. initialising the AzureBlobFilesystem.
  • Code documentation added - There are code comments, but I think ideally there would be a page in the documentation for Azure.
  • Any pending items have an associated Issue - There aren't any

Complete description

Added a new component to DataConfig for Azure similar to the ones for S3 and GCS. This allows configuring the flytekit's storage account without disrupting other Azure auth that may be used by flyte task implementations. This is done by setting the environment variables FLYTE_AZURE_STORAGE_ACCOUNT_NAME, FLYTE_AZURE_STORAGE_ACCOUNT_KEY, FLYTE_AZURE_TENANT_ID, FLYTE_AZURE_CLIENT_ID, FLYTE_AZURE_CLIENT_SECRET. You can set these using a pod template so that authentication to your flyte storage account works across your whole flyte deployment, inrespective of what authentication each task uses.

Use DataConfig to construct fsspec storage options in data_persistence.py. We reuse the same function for fetchign storage_options elsewhere inlcuding for encode and decode of structured datasets. This ensures local, S3, GCS and Azure work wherever fsspec is used.

You should specify configmap.core.propeller.rawoutput-prefix to be in the format abfs://<container-name>/<path-within-container>.

Tracking Issue

Related discussion was on flyteorg/flyte#3962 but technically that issues was about workload identity which was mostly fixed by #1813

Follow-up issue

NA

@welcome
Copy link

welcome bot commented Sep 19, 2023

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Thomas Newton <[email protected]>
@kumare3
Copy link
Contributor

kumare3 commented Sep 21, 2023

cc @fiedlerNr9

@codecov
Copy link

codecov bot commented Sep 21, 2023

Codecov Report

Attention: 27 lines in your changes are missing coverage. Please review.

Comparison is base (1737048) 18.73% compared to head (72a440e) 54.74%.
Report is 27 commits behind head on master.

Additional details and impacted files
@@             Coverage Diff             @@
##           master    #1842       +/-   ##
===========================================
+ Coverage   18.73%   54.74%   +36.01%     
===========================================
  Files         332      293       -39     
  Lines       31398    22092     -9306     
  Branches     3082     2167      -915     
===========================================
+ Hits         5882    12094     +6212     
+ Misses      25432     9849    -15583     
- Partials       84      149       +65     
Files Coverage Δ
flytekit/configuration/__init__.py 73.42% <100.00%> (+1.91%) ⬆️
flytekit/configuration/internal.py 80.26% <100.00%> (-0.26%) ⬇️
...t-polars/flytekitplugins/polars/sd_transformers.py 100.00% <100.00%> (ø)
flytekit/types/structured/basic_dfs.py 37.96% <25.00%> (+37.96%) ⬆️
flytekit/core/data_persistence.py 38.88% <30.00%> (+4.29%) ⬆️

... and 359 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@Tom-Newton Tom-Newton force-pushed the tomnewton/add_azure_data_config branch from aea52be to 2d4f54b Compare September 23, 2023 01:06
kwargs["anon"] = False
return fsspec.filesystem(protocol, **kwargs) # type: ignore

# Preserve old behavior of returning None for file systems that don't have an explicit anonymous option.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the type hint on get_filesystem_for_path it looks like we assume get_filesystem never returns None so I thought probably best to delete this. If anyone thinks its important I'm happy to add it back.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wild-endeavor I think you wanted to delete this already? So this seems like a good idea

Signed-off-by: Thomas Newton <[email protected]>
@Tom-Newton Tom-Newton force-pushed the tomnewton/add_azure_data_config branch from 9a110b4 to b689b3c Compare September 25, 2023 13:55
Signed-off-by: Thomas Newton <[email protected]>
@Tom-Newton Tom-Newton force-pushed the tomnewton/add_azure_data_config branch from de3a74b to 5dd00ed Compare September 25, 2023 22:01
@Tom-Newton Tom-Newton marked this pull request as ready for review September 26, 2023 09:17
@Tom-Newton
Copy link
Contributor Author

Tom-Newton commented Sep 26, 2023

Hmm... it looks like a couple of my new tests are failing on the windows builds.

E           Expected: to_parquet('abfs://container/parquet_df/00000', coerce_timestamps=<ANY>, allow_truncated_timestamps=<ANY>, storage_options={'account_name': 'accountname_from_storage_options'})
E           Actual: to_parquet('abfs://container/parquet_df\\00000', coerce_timestamps='us', allow_truncated_timestamps=False, storage_options={'account_name': 'accountname_from_storage_options'})

Asserting on the path was not really the primary purpose of this test so I will probably make the assertion just on the storage_options. I suspect this would work because blob storage blob names can be arbitrary strings but normally forward slashes / are used, and various tools will expect / to be the directory seporator.

@fiedlerNr9
Copy link
Contributor

tested on a live cluster and works like expected - also with workload identities 👍

@Tom-Newton
Copy link
Contributor Author

There are 3 test suites that timeout out after 6 hours. It looks like they all failed in the Cache pip stage so I don't think these are related to my changes.

Copy link
Contributor

@wild-endeavor wild-endeavor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some nits

flytekit/core/data_persistence.py Outdated Show resolved Hide resolved
flytekit/configuration/internal.py Show resolved Hide resolved
flytekit/core/data_persistence.py Outdated Show resolved Hide resolved
Signed-off-by: Thomas Newton <[email protected]>
@wild-endeavor wild-endeavor merged commit 9b71dbf into flyteorg:master Sep 28, 2023
@welcome
Copy link

welcome bot commented Sep 28, 2023

Congrats on merging your first pull request! 🎉

Future-Outlier pushed a commit to Future-Outlier/flytekit that referenced this pull request Oct 3, 2023
Signed-off-by: Thomas Newton <[email protected]>
Signed-off-by: Future Outlier <[email protected]>
@Tom-Newton Tom-Newton deleted the tomnewton/add_azure_data_config branch November 30, 2023 15:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants