Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should predict.sdmTMB return when offset is left NULL? #274

Open
seananderson opened this issue Nov 3, 2023 · 1 comment
Open

What should predict.sdmTMB return when offset is left NULL? #274

seananderson opened this issue Nov 3, 2023 · 1 comment

Comments

@seananderson
Copy link
Member

I just realized that glm() and glmmTMB both always include the offset in the prediction and always return the prediction on the original data no matter what newdata is if the offset argument is used instead of the offset() formula. This seems kind of crazy to me, but I guess that's the standard.

library(sdmTMB)
dat <- subset(dogfish, catch_weight > 0)

m <- glm(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept))
p1 <- predict(m)
p2 <- predict(m, newdata = dat[1:2,])
#> Warning in predictor + offset: longer object length is not a multiple of
#> shorter object length

p1[1:10]
#>        1        4        5        6        7        9       10       11 
#> 4.863265 5.035115 4.798726 5.035115 4.923889 4.798726 5.035115 4.863265 
#>       12       13 
#> 4.923889 4.798726
p2[1:10]
#>  [1] 4.863265 5.035115 4.798726 5.035115 4.923889 4.798726 5.035115 4.863265
#>  [9] 4.923889 4.798726


m <- glmmTMB::glmmTMB(catch_weight ~ 1, data = dat, family = Gamma("log"), offset = log(dat$area_swept))
p1 <- predict(m)
p2 <- predict(m, newdata = dat[1:2,])

p1[1:10]
#>  [1] 4.863263 5.035113 4.798724 5.035113 4.923887 4.798724 5.035113 4.863263
#>  [9] 4.923887 4.798724
p2[1:10]
#>  [1] 4.863263 5.035113 4.798724 5.035113 4.923887 4.798724 5.035113 4.863263
#>  [9] 4.923887 4.798724

Created on 2023-11-03 with reprex v2.0.2

That changes if you put the offset in the formula, which sdmTMB does not currently support:

library(sdmTMB)
dat <- subset(dogfish, catch_weight > 0)

m <- glm(catch_weight ~ 1 + offset(area_swept), data = dat, family = Gamma("log"))
p1 <- predict(m)
nd <- dat[1:2,]
nd$area_swept <- 0
p2 <- predict(m, newdata = nd)

p1[1:10]
#>        1        4        5        6        7        9       10       11 
#> 4.970636 4.989926 4.964206 4.989926 4.977066 4.964206 4.989926 4.970636 
#>       12       13 
#> 4.977066 4.964206
p2
#>        1        4 
#> 4.867756 4.867756

m <- glmmTMB::glmmTMB(catch_weight ~ 1 + offset(area_swept), data = dat, family = Gamma("log"))
p1 <- predict(m)
p2 <- predict(m, newdata = nd)

p1[1:10]
#>  [1] 4.970636 4.989926 4.964206 4.989926 4.977066 4.964206 4.989926 4.970636
#>  [9] 4.977066 4.964206
p2
#> [1] 4.867756 4.867756

Created on 2023-11-03 with reprex v2.0.2

We need the ability to set the offset to 0 for index standardization, among other uses. It's obvious what to do when newdata and offset are specified in predict.sdmTMB(). But what about:

newdata = NULL, offset = NULL: predict on original data with offset = 0
newdata = df, offset = NULL: predict on df with offset = 0
newdata = NULL, offset = some_offset: predict on original data with offset = some_offset; this was wrong and was being returned as original data and original offset

I assume I should do the above. For one, people have code that depends on it. I never realized the difference in the formula vs. argument offset specification.

Just writing this out to document the thought process.

@Lewis-Barnett-NOAA
Copy link
Collaborator

Ah, this is timely as I was just trying to help a colleague that was having trouble estimating marginal effects with ggeffects. The error was indicating different offset lengths.

Including the offset in the formula always seems quite inelegant.

Speaking of offsets more generally, what is the best example of how different the results are with an effort offset vs modeling say cpue with 0 offset? I imagine it depends on the disparity in effort values? Or is the stats theory justification strong enough to indicate that this could substantially change important outputs or inference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants