Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add http-sync #1177

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

add http-sync #1177

wants to merge 5 commits into from

Conversation

martindurant
Copy link
Member

@philippjfr this was for you

@philippjfr
Copy link

This is great thanks. I'll try to test this asap. Is there anything you have to enable to switch to the sync implementation?

@martindurant
Copy link
Member Author

You will need to either explicitly instantiate the class yourself, or call fsspec.register_implementation with both "http" and "https" to have it kick out the standard async version when using the standard API, which something like pd.read_csv does.

Mostly copied from async

Question: should async FS use sync/requests stuff for things that don't multiplex?
We would be able to share code.
@martindurant martindurant marked this pull request as ready for review March 11, 2023 20:56
@martindurant
Copy link
Member Author

Would appreciate someone knowledgeable to chime in: how to test the pyodide functionality here?

Also, should this be a separate package, or should it instead merge like code with normal HTTPFileSystem, which of course has a lot in common?

Last question: I implement my own version of JS requests here, but there are others out there. Can we use one of those, so that the CPython and JS versions of this FS look identical? I suspect no...

@martindurant
Copy link
Member Author

Considering merging this as-is, but experimental and unsupported, so that it can be included in a release for wider testing.

@lobis
Copy link
Contributor

lobis commented Nov 13, 2023

Hello @martindurant, I think this PR would be enough so that fsspec can be used to get http files using pyodide, right? Are there any plans to merge this PR in the near future? We would be very interested in this feature and can provide testing. Thanks!

@martindurant
Copy link
Member Author

Yes, I was of the opinion that this PR was at least ready for serious testing. Unfortunately, I don't really know how to test it within CI - binary blocking transfers must happen in a webworker (but pyscript has made running python code in a worker smoother recently). There are probably unfinished features. Also, it's ugly to copy so much code, but perhaps that's not important.

There has been some renewed interest in getting this in, and I would dearly love to demo a pydata flow (e.g., intake-fsspec-pandas, or even fastparquet) in the browser. Note that https://github.com/koenvo/pyodide-http is probably better established than my requests shim, however - we should at least look at how they do things and find the sharp edges.

@wachsylon
Copy link

@martindurant I recently used this http-sync filesystem successfully with pyodide. Is this an outdated approach, especially with zarr v3? If not, can we somehow make it usable with the reference filesystem and create sth similar for an s3 filesystem? that would enhance pyodide-support directly. Thanks a lot!

@martindurant
Copy link
Member Author

Is this an outdated approach, especially with zarr v3?

Zarr v3 is async internally and runs an event loop on a thread, in a way similar to how fsspec does for async FSs. This is definitely not compatible with pyodide! However, there is an async interface to zarr groups/arrays too, so it's possible that #960 would work, or wrapping the implementation in this PR with AsyncFileSystemWrapper, but I am not sure.

I don't believe the zarr team has any thoughts about pyodide, I would raise an issue on zarr-python.

@martindurant
Copy link
Member Author

That pyodide URL does load the interface, but no data. Can it be filled in with a public (no HTTP header needed) URL just to prove the technology stack?

@wachsylon
Copy link

Well the idea is that one can open any open native zarr url with this App. I am not sure about cross-reference setting... If it is not working for you, i can provide an example with a preload. In the mean time, this example url should work.

@martindurant
Copy link
Member Author

martindurant commented Jan 25, 2025

Says

{
detail: "Not Found"
}

@wachsylon
Copy link

Yes, but it is a consolidated zarr dataset. E.g. URL/.zmetadata works.
Anyway, you specify that URL in the zarr text input and click open (without options).

@martindurant
Copy link
Member Author

Yes, but it is a consolidated zarr dataset. E.g. URL/.zmetadata works.

OK, now I follow you - it works nicely!

@wachsylon
Copy link

wachsylon commented Jan 27, 2025

So what would you recommend to proceed?

  1. Wrap this implementation in a AsyncFileSystemWrapper. This sounds rather easy. Do you have an example?
  2. Using Add pyodide HTTP FS/aiohttp shim #960 ? I fear that this may not work. When I use the http-sync implementation in the chunk-urls of the reference filesystem, the reference file system still uses async methods to get these chunks. I guess that would also happen if I replace it with a js-filesystem.

For both cases, I would be very unsure about how to set it up right or even implement it.

@martindurant
Copy link
Member Author

I think in the long run, zarr v3's async API coupled with option 2) is the way to go. Sync fetching of data with potentially many many chunks will not be scalable and suffer from sequential latency (although it's questionable how many bytes we should load into the browser anyway!). I haven't tried it, but it shouldn't be too hard, even with a referenceFS layer over js-http (with asynchronous=True). The top-level functions triggered by user interaction will need to be async too.

I still think that the sync implementation here is useful, though, and your zarr2 example shows this nicely. It will also useful for any pydata API that has no all-async code path, which is most of them (e.g., pd.read_parquet, intake read YAML catalog).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants