Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Rasters.sample #791

Merged
merged 31 commits into from
Oct 16, 2024
Merged

Add Rasters.sample #791

merged 31 commits into from
Oct 16, 2024

Conversation

tiemvanderdeure
Copy link
Contributor

@tiemvanderdeure tiemvanderdeure commented Oct 11, 2024

This PR adds a Rasters.sample function in a StatsBase extension, as suggested here: #771

The keywords correspond to extract keywords. Some of the internals are also re-used.

I would suggest not exporting this to avoid confusion with StatsBase.sample, which has a totally different API.

If skipmissing = true, Rasters.sample and extract will always return the same types - otherwise it depends on the element type of the Raster. I tried to avoid creating Union{Missing, ...} as much as possible.

Closes #771

@tiemvanderdeure
Copy link
Contributor Author

tiemvanderdeure commented Oct 11, 2024

Rasters.sample is about just as fast as extract.

What slows it down in some cases is the conversion that happens if skipmissing = true to get rid of a Union type.

Just something like

A = [missing 7.0f0; 2.0f0 missing]
ga = Raster(A, (X(1.0:1:2.0), Y(1.0:1:2.0)); missingval=missing) 
Rasters.sample(ga, 1_000_000, skipmissing = true)

takes 0.3 seconds because of this:
image

@tiemvanderdeure
Copy link
Contributor Author

Tests fail because of [5453] signal 11 (128): Segmentation fault. What a joy.

tiemvanderdeure and others added 3 commits October 11, 2024 16:13
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
ext/RastersStatsBaseExt/RastersStatsBaseExt.jl Outdated Show resolved Hide resolved
ext/RastersStatsBaseExt/sample.jl Show resolved Hide resolved
ext/RastersStatsBaseExt/sample.jl Outdated Show resolved Hide resolved
ext/RastersStatsBaseExt/sample.jl Outdated Show resolved Hide resolved
ext/RastersStatsBaseExt/sample.jl Outdated Show resolved Hide resolved
@rafaqz
Copy link
Owner

rafaqz commented Oct 11, 2024

What slows it down in some cases is the conversion that happens if skipmissing = true to get rid of a Union type.

Do we know the answer from the type? What do you think is stopping it propagating?

It could be your g::Type instead of Type{G} I commented on in the other PR. Lets workshop it over there rather than splitting the disscussion. We probably need a benchmark in #790 to show this problem

(The segmentation fault is of course GDAL)

@tiemvanderdeure
Copy link
Contributor Author

Sorry for all the mess with the rebases. I added an extra keyword to _rowtype to distinguish between whether index/geometry fields may be missing, and wether values extracted from the raster may be missing. The former is never the case in sample. So now we can completely reuse the same internal.

@tiemvanderdeure
Copy link
Contributor Author

tiemvanderdeure commented Oct 12, 2024

Somehow NamedTuple(x[idx])[K] is not type stable and I can't quite figure out why...

@tiemvanderdeure
Copy link
Contributor Author

Was going to say that looks dubious. Probably we should subset the stack with st[K] before we get to this method to reduce indexing. Then you can just do:

x[idx]

I tried this but it failed tests, because st[K] does not necessarily have all the dimensions of st...

@tiemvanderdeure
Copy link
Contributor Author

This should be type stable now in all cases.

I would still like to change the geometry keywords, which now is not very flexible. It could be nice to allow users to specify the type of the geometry column, and which dimensions should be used. So something like geometry = (Y=Y, X=X) would return a Y-X NamedTuple with data from the Y and X dimensions.

@tiemvanderdeure
Copy link
Contributor Author

Tests are still failing on a segmentation fault in resample 😭

@rafaqz
Copy link
Owner

rafaqz commented Oct 13, 2024

Ahh that's interesting about the dimensions. I guess we need to be consistent and return a table of the same length every time.

You can put the resample tests last to get around them for now. But we will need to deal with them before bumping a version.

@rafaqz
Copy link
Owner

rafaqz commented Oct 13, 2024

We can totally let users specify the geometry like that too.

@tiemvanderdeure
Copy link
Contributor Author

Ahh that's interesting about the dimensions. I guess we need to be consistent and return a table of the same length every time.

It's how we designed extract so I think it's good to be consistent. extract(stack, point; names) is not the same as extract(stack[names], points).

@rafaqz
Copy link
Owner

rafaqz commented Oct 14, 2024

Maybe we should put that exact line in the extract docs

Comment on lines 51 to 52
_getindex(::Type{T}, x::AbstractRasterStack{<:Any, NT}, points, idx) where {T, NT} =
RA._maybe_add_fields(T, NT(x[RA.commondims(idx, x)]), points[RA.commondims(idx, points)], val(idx))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly using the eltype of the stack like this was the only way I could find in which this would compile away. But it looks somehow ugly and I don't know if there's a better way?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like what? Is this comment on the wrong line?

@tiemvanderdeure
Copy link
Contributor Author

Maybe we should put that exact line in the extract docs

Maybe. This feature justifies the having the names keyword in the first place, but it probably has a limited number of use cases.

@tiemvanderdeure
Copy link
Contributor Author

I added the option to specify a geometry type, even though it now only works for Tuple and NamedTuple. For other types we can't just call convert, so it's not so easy to make it more generic.

broadcast(I -> _getindex(T, x, points, I), indices)

_getindex(::Type{T}, x::AbstractRasterStack{<:Any, NT}, points, idx) where {T, NT} =
RA._maybe_add_fields(T, NT(x[RA.commondims(idx, x)]), points[RA.commondims(idx, points)], val(idx))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird this NT trick is needed, maybe DD stack getindex stability is broken on 1.11?

Looks like something needs to be forced somewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, nothing else worked. Even AbstractRasterStack{K} and then NamedTuple{K} wasn't type stable.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that makes sense, it's the eltype that's not propagating.

We might need some more Base.assume_effects in DD indexing

@rafaqz
Copy link
Owner

rafaqz commented Oct 14, 2024

Is there a test for the NamedTuple geometry?

@tiemvanderdeure
Copy link
Contributor Author

Is there a test for the NamedTuple geometry?

Yes, although it's pretty rudimentary.

    @test typeof(first(
        Rasters.sample(StableRNG(123), test, 2, geometry = (X = X, Y = Y))
    ).geometry) <: NamedTuple{(:X, :Y)}

@rafaqz
Copy link
Owner

rafaqz commented Oct 14, 2024

Ok that's fine, I like the syntax of just passing the tuple. Does it work for symbols and constructed dims too?

Probably want to convert those values to constructed basedims for type stability, Types get lost in the tuple

@tiemvanderdeure
Copy link
Contributor Author

It just calls commondims, so I think constructed dims yes, symbols no. This is parsed and a _True() or _False(), a dimtuple, and a type are passed to _sample, so that should all be type stable.

NamedTuple types still allocate though, idk if/how to avoid that?

@rafaqz
Copy link
Owner

rafaqz commented Oct 14, 2024

That's what I mean. For NamedTuple you need to pass through a NamedTuple of constructed dimensions, not types or symbols. That can be type stable making NamedTuple points, the others can't be

DD always tries to convert to constructed dims asap for this reason. If it can inline or constprop to the point of construction it can be totally stable using types. Otherwise it's super bad

@tiemvanderdeure
Copy link
Contributor Author

I just added a dispatch so Rasters.sample(myraster) also works and returns just a NamedTuple. I don't have anything else to add here. The methods tests are before resample so if it segfaults again then tests have passed.

@rafaqz
Copy link
Owner

rafaqz commented Oct 16, 2024

Amazing let's merge

@rafaqz rafaqz merged commit 1f00143 into rafaqz:main Oct 16, 2024
1 of 2 checks passed
@rafaqz
Copy link
Owner

rafaqz commented Oct 16, 2024

Ohh maybe some links in the methods page of the docs would have helped

@tiemvanderdeure
Copy link
Contributor Author

True, I forgot about that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add random sampling functionality (like R's terra::spatSample())
3 participants