Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lazy=true seems unnecessarily lazy? #823

Open
asinghvi17 opened this issue Nov 18, 2024 · 7 comments
Open

lazy=true seems unnecessarily lazy? #823

asinghvi17 opened this issue Nov 18, 2024 · 7 comments

Comments

@asinghvi17
Copy link
Collaborator

Currently, lazy = true defaults to always creating a FileArray, which means that any read requires re-opening the dataset, and calling GDAL again if using e.g a VRT.

Consider this benchmark on a VRT of Copernicus DEM. dem is the raster, opened with lazy = true:

julia> @be dem[$(1:10), $(1:10)] seconds=2
Benchmark: 3 samples with 1 evaluation
        927.189 ms (107 allocs: 6.812 KiB)
        939.257 ms (107 allocs: 6.812 KiB)
        941.771 ms (107 allocs: 6.812 KiB)

That's pretty bad. Now, see what happens when I simply run the benchmark inside open(dem):

julia> open(dem) do d
       @be d[$(1:10), $(1:10)] seconds=2
       end
Benchmark: 877 samples with 1 evaluation
 min    2.165 ms (76 allocs: 4.125 KiB)
 median 2.177 ms (76 allocs: 4.125 KiB)
 mean   2.185 ms (76 allocs: 4.125 KiB)
 max    3.119 ms (76 allocs: 4.125 KiB)

Users will initially try lazy = true, see that the performance is horrible, and move on with their lives - when we could have 2 OOM better performance on reads!

Since we have finalizers in Julia, it seems like we could make lazy = true at least open the GDAL dataset, and lazy = false would eagerly read into memory as it currently does? Or is there something I am missing here?

@asinghvi17 asinghvi17 added bug Something isn't working performance needs thought and removed bug Something isn't working labels Nov 18, 2024
@felixcremer
Copy link
Contributor

This is a difference of use cases. If you have only a few files it is no problem to keep the file open in the background, but with a few hundred you get memory issues and at some point GDAL cant handle it anymore.

@asinghvi17
Copy link
Collaborator Author

Aha, didn't realize that case was also a thing. Too used to working with Kerchunk I guess :D

We probably need a lot more documentation around lazy, open and read then. open also doesn't work by itself (it needs an associated function or open(ras) do A) so allowing fast_ras = open(ras) would be great as well.

@rafaqz
Copy link
Owner

rafaqz commented Nov 18, 2024

Yeah, Rasters has a "no open files" policy.

I had a lot of issues hitting is open file limits early on. Like 400 and and that's it on Linux.

So they are really lazy.

We should add a Raster constructor with a function argument so it combines with open smoothly. But not sure I want to allow actually open files. With GDAL that means hanging C objects.

@asinghvi17
Copy link
Collaborator Author

asinghvi17 commented Nov 18, 2024

Isn't that what IDataset is for in ArchGDAL, for instance? It has an attached finalizer. Any ArchGDAL/GeoDataFrames-read geometry is in a similar position...

@rafaqz
Copy link
Owner

rafaqz commented Nov 18, 2024

Yep. It doesn't get around file limits... And it's relying on people using close. Currently Rasters doesn't make you think about things like that.

@asinghvi17
Copy link
Collaborator Author

asinghvi17 commented Nov 18, 2024

relying on people using close

sorry to belabour this point - but isn't the point of a finalizer that you don't need to use close? The first GC sweep after the object goes out of scope will do it for you. For interactive REPL use it would leave a hanging pointer, but then something was always going to do that, maybe it was your geometries or something else.

I get your point about file limits though. Maybe this is something to do only if lazy = false, or maybe we can have some other keyword that controls this...

@rafaqz
Copy link
Owner

rafaqz commented Nov 18, 2024

You don't need to use close if you don't care about open files I guess.

We could add open that just gives you the GDAL array wrapped in a Raster...

But what's wrong with using an open closure?

I never find myself limited by this. I don't really get the use case.

(ArchGDAL is also designed to work like this everywhere and prefer closures where possible)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants