Use fsspec as a general data fetch back end #102
tsjackson-noaa
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm starting this discussion to upload my notes on fsspec, a Python library for accessing data on remote filesystems that was spun off from work done in Dask and Apache Arrow. Its main selling point is that it implements a single, unified interface to transferring remote data from a variety of filesystems --It's used by intake to allow you to seamlessly load catalog query results into dask.
I think the framework should continue to move towards using Intake for data queries (or rather, wrapping it as one of multiple possible query interfaces), so it makes sense to investigate fsspec as a corresponding way to implement fetching remote data. This becomes especially important if we want to implement a more efficient "dask-only" execution pipeline, where data is handed off from the framework to PODs as in-memory xarray Datasets instead of being written to netcdf files which are then re-opened. This is one of the design advantages of OM4labs.
Currently available backends (protocols) for fsspec are documented here and here. In particular, I think use of fsspec would provide the most straightforward path to running MDTF PODs on cloud-hosted data (S3, GCS, stuff on Azure) -- but that should be spun off into its own thread.
A fringe benefit is that fsspec implements local caching of downloaded data from all sources ("for free," via protocol chaining). In the MDTF package caching is implemented manually, in an ad-hoc (ie per-DataSource) way. It also has support (in some cases) for async read/writes, which will probably be a main avenue for getting better performance out of the framework.
Beta Was this translation helpful? Give feedback.
All reactions