-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soften buildstock.csv validation for ComStock compatibility #384
Conversation
I'm not exactly aware of the situation here and so may be completely misinterpreting what's going on here... But note that ResStock also has columns in the buildstock.csv that are not used for simulation and are only used for sampling. These are often called meta-parameters and they are still added to the options_lookup.tsv to prevent validation errors. (For example, HVAC Cooling Type doesn't provide direct arguments for an OpenStudio measure, rather it's only used as a dependency for downstream parameters in sampling.) Is it problematic for ComStock to likewise just add these meta-parameters to the options_lookup.tsv? Simply bypassing the validation can obviously mean that legitimate errors are missed. |
@rajeee I think this looks good with your proposed change. For ComStock, we'll figure out how to add checks later |
…heck_compatibility
Minimum allowed coverage is Generated by 🐒 cobertura-action against 1bcecac |
@nmerket @shorowit I took at look at implementing your suggestion of adding no-op options_lookup.tsv rows for the unused buildstock.csv columns. This would require adding ~70,000 additional rows to the options_lookup.tsv, ~60k of which are just the unique set of census tract IDs. I think this would make that file even more difficult to edit. So my preference at this point is to just skip the validation for ComStock. |
Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them. I think it would be a one-line change from: buildstockbatch/buildstockbatch/base.py Lines 341 to 343 in 42c2cb1
to:
|
This would result in large number of false warnings which would drown other legitimate warnings (It's false warning if we expect to see them in correct options_lookup.tsv). It will also desensitize people from warnings. I am not a fan of this approach. Can we have a middle ground? Change the validation so that, if a TSV is missing in options_lookup.tsv we ignore it and give it a pass. But if it is present, all of the options in the TSV should be present in the options_lookup.tsv? |
@rajeee I think your suggestion makes sense: if a column is found, definitely useful to check that all of the column values are found in the options_lookup. For ComStock, this would at least catch the worst errors |
@rajeee Are you saying you want this?
That would prevent false warnings, but also legitimate issues, no? Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown. |
Yes. That's what I was suggesting. True, it will miss some legitimate issues. It's a compromise. We won't be losing a lot though. The TSVs we won't be validating are the ones that don't have any measure impact. They are basically just pass-through into the output. So, maybe it's not too bad to let the TSV themselves the be single source of truth on what those outputs are going to be. We might have to adjust the ResStock integrity check to allow for this. |
@shorowit there are 57 intentionally unused/no-op columns in the ComStock buildstock.csv that show up as warnings (see below). Most of these are intermediate steps in the sampling that are useful. If we made the change you suggested, then I can reduce the number of warnings dramatically by adding most of the column/value pairs to the options_lookup.tsv and would only need to add ~800 rows to do so. The only remaining warnings would be for the census tract, county, PUMA, and weather file name columns.
|
I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start. |
Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach. |
We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal. |
Looks like I spoke wrong on that one. Currently, ResStock considers F4 to be incorrect, but ComStock considers it to be correct. There are two proposals here. Proposal 2: (from Andrew originally) Keep status quo and skip validation for comstock. ComStock loses out in chance to catch incorrect TSVs early on, but ResStock gets to keep its convention of F4=incorrect and catch almost all cases of incorrect TSVs. Note that we currently do not have a way to catch E3, E4 in buildstockbatch. Though it is options_lookup error not a characteristics TSV error. So, if the shortcomings of proposal 1 is not acceptable for ResStock, maybe we could apply proposal 1 only to ComStock. That way, the validator for ResStock catches D3, D4, F3 and F4 and for ComStock it will catch D3 and D4 and miss out on F3 (F4 is correct for ComStock.) |
Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be |
Please do not hard-code columns into buildstockbatch. It is difficult for ComStock to keep up with buildstockbatch changes as-is, the more we can avoid having new BSB versions due to added/removed/changed buildstock.csv or options_lookup.tsv changes. |
Building on this proposal, instead of maintaining two different files (option_lookup.tsv and the proposed new text file) to keep track of required/not-required TSVs, can we introduce a wildcard "*" in options_lookup.tsv to refer to any option. So, option_lookup.tsv would look like:
This will only increase the number of rows in option_lookup.tsv slightly for ComStock (only the TSV name for the currently missing TSVs needs to be added instead of one row for each of its options). Maybe this is even more complicating things, but I kinda prefer this over maintaining one extra text file. Thoughts? |
@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv |
I like the idea of using wildcards, @rajeee! Seems like a nice compromise. |
Wildcards sound like a great solution. I also agree that we should avoid hardcoding things into buildstockbatch. |
Glad we reached an agreement with the wildcard approach!
You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out? |
Yeah, if the * becomes a valid option then I will list all the no-op columns in the buildstock.csv |
@asparke2 I added the wildcard feature. Please check. |
…heck_compatibility
if "*" in param_option_dict[column]: | ||
continue # skip validating options when wildcard is present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
buildstockbatch/schemas/v0.3.yaml
Outdated
type: str(required=True) | ||
type: enum('residential_quota', 'residential_quota_downselect', 'precomputed', 'commercial_sobol', required=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather not do this. When someone goes to add a sampler, they'll have to remember to update this. I'll add a validation function that checks that the sampler actually exists instead.
It looks like we're already checking and validating whether the sampler exists here. buildstockbatch/buildstockbatch/base.py Lines 309 to 315 in baf2eb6
I removed the redundant enumerations in the yaml schema. @asparke2 I think this one looks good. I can merge if it comes back happy from CI. Do you want to do any additional testing of ComStock with it before I merge? |
@nmerket @rajeee ComStock uses the string "NA" in the buildstock.csv and options_lookup.tsv. I believe that we never noticed this before because the measures that use "NA" interpret as a no-op, but now that all options are being validated, I think that changing I don't know if ResStock uses "NA" anywhere. |
@nmerket @rajeee the simulations are all failing because there are no valid options for the intentionally-unused parameters in the Intentionally-unused column in Error in
I think we will need to change this ComStock code and this ResStock code to know about the wildcard |
This also makes me wonder if we should change |
Soften buildstock.csv validation to throw warnings instead of errors for ComStock compatibility. The buildstock.csv file in ComStock contains some columns that are intermediate steps in the sampling process which are intentionally not used for simulation. We keep these intermediate sampling steps because they help explain the sampling process, and we join some of these into the results.csv data because they are useful to users of the dataset.