Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why is imputeBadAgeModel fitted using bad age data subset? #48

Open
CeresBarros opened this issue May 12, 2020 · 7 comments
Open

Why is imputeBadAgeModel fitted using bad age data subset? #48

CeresBarros opened this issue May 12, 2020 · 7 comments
Labels
question Further information is requested

Comments

@CeresBarros
Copy link
Member

In LandR::makeAndCleanInitialCohortData, used in Biomass_borealDataPrep why is the model to input bad ages being fit with the data subset that has the bad ages, instead of the data subset that has good ages?

outAge <- Cache(statsModel, modelFn = imputeBadAgeModel,
                    uniqueEcoregionGroups = .sortDotsUnderscoreFirst(as.character(unique(cohortDataMissingAgeUnique$initialEcoregionCode))),
                    .specialData = cohortDataMissingAgeUnique,
                    omitArgs = ".specialData")
@CeresBarros CeresBarros added the question Further information is requested label May 12, 2020
@CeresBarros
Copy link
Member Author

CeresBarros commented May 12, 2020

in addition the metadata for P(sim)$imputeBadAgeModel states:
"Model and formula used for imputing ages that are either missing or do not match well with Biomass or Cover. Specifically, if Biomass or Cover is 0, but age is not, then age will be imputed. Similarly, if Age is 0 and either Biomass or Cover is not, then age will be imputed."

However, the subsetting of "bad age" data in LandR::makeAndCleanInitialCohortData uses:

cohortDataMissingAge <- cohortData[, hasBadAge :=
                                     #(age == 0 & cover > 0)#| # ok because cover can be >0 with biomass = 0
                                     (age > 0 & cover == 0) |
                                     is.na(age) #|
                                     #(B > 0 & age == 0) |
                                     #(B == 0 & age > 0)
][hasBadAge == TRUE]#, by = "pixelIndex"]

So it seems to me that "Similarly, if Age is 0 and either Biomass or Cover is not, then age will be imputed"is not accurate.

@achubaty
Copy link
Collaborator

@CeresBarros has this been resolved with all the various changes over the last few months?

@CeresBarros
Copy link
Member Author

No :/

@CeresBarros
Copy link
Member Author

P(sim)$imputeBadAgeModel now agrees with the code in LandR::makeAndCleanInitialCohortData:
Model and formula used for imputing ages that are either missing or do not match well with biomass or cover. Specifically, if biomass or cover is 0, but age is not, or if age is missing (NA), then age will be imputed.
Note that age is zeroed where total biomass is 0 in LandR:::.createCohortData, which is run before makeAndCleanInitialCohortData

However, I'm still puzzled with the age data that is used to fit the model.

@CeresBarros
Copy link
Member Author

CeresBarros commented Oct 20, 2022

Digging deeper:
At some point before fitting the model the cohortDataMissingAgeUnique object is stripped of all data, except unique combos of "initialEcoregionCode" and "speciesCode":

cohortDataMissingAgeUnique <- unique(cohortDataMissingAge,
                                           by = c("initialEcoregionCode", "speciesCode")
      )[
        , .(initialEcoregionCode, speciesCode)
      ]

After this, the data is added back to these combos, from the original cohortData:

      cohortDataMissingAgeUnique <- cohortDataMissingAgeUnique[
        cohortData,
        on = c("initialEcoregionCode", "speciesCode"), nomatch = 0
      ]
      cohortDataMissingAgeUnique <- cohortDataMissingAgeUnique[!is.na(cohortDataMissingAgeUnique$age)]

However, since "bad" age lines were not removed from cohortData they're being added back (with the exception of NA ages which are excluded, see above). So it seems to me that bad ages of (age > 0 & cover == 0) are being used to fit the model that will later impute/overwrite these same ages.
@eliotmcintire since you wrote this I guess you're the best person to ask "is there a reason why this is being done like this"? Were there maybe not enough data points per "initialEcoregionCode", "speciesCode" combo if the bad ages were excluded for fitting?

@eliotmcintire
Copy link
Collaborator

eliotmcintire commented Oct 21, 2022 via email

@CeresBarros
Copy link
Member Author

No worries. We'll have to revisit it soon then and make a decision (with comments ;) ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants