Use fsspec as a general data fetch back end #102

tsjackson-noaa · 2021-01-18T19:20:59Z

tsjackson-noaa
Jan 18, 2021

I'm starting this discussion to upload my notes on fsspec, a Python library for accessing data on remote filesystems that was spun off from work done in Dask and Apache Arrow. Its main selling point is that it implements a single, unified interface to transferring remote data from a variety of filesystems --It's used by intake to allow you to seamlessly load catalog query results into dask.

I think the framework should continue to move towards using Intake for data queries (or rather, wrapping it as one of multiple possible query interfaces), so it makes sense to investigate fsspec as a corresponding way to implement fetching remote data. This becomes especially important if we want to implement a more efficient "dask-only" execution pipeline, where data is handed off from the framework to PODs as in-memory xarray Datasets instead of being written to netcdf files which are then re-opened. This is one of the design advantages of OM4labs.

Currently available backends (protocols) for fsspec are documented here and here. In particular, I think use of fsspec would provide the most straightforward path to running MDTF PODs on cloud-hosted data (S3, GCS, stuff on Azure) -- but that should be spun off into its own thread.

A fringe benefit is that fsspec implements local caching of downloaded data from all sources ("for free," via protocol chaining). In the MDTF package caching is implemented manually, in an ad-hoc (ie per-DataSource) way. It also has support (in some cases) for async read/writes, which will probably be a main avenue for getting better performance out of the framework.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use fsspec as a general data fetch back end #102

{{title}}

Replies: 0 comments

Select a reply

Use fsspec as a general data fetch back end #102

tsjackson-noaa Jan 18, 2021

Replies: 0 comments

tsjackson-noaa
Jan 18, 2021