Soften buildstock.csv validation for ComStock compatibility #384

asparke2 · 2023-08-04T13:10:13Z

Soften buildstock.csv validation to throw warnings instead of errors for ComStock compatibility. The buildstock.csv file in ComStock contains some columns that are intermediate steps in the sampling process which are intentionally not used for simulation. We keep these intermediate sampling steps because they help explain the sampling process, and we join some of these into the results.csv data because they are useful to users of the dataset.

…tibility

shorowit · 2023-08-08T18:00:32Z

I'm not exactly aware of the situation here and so may be completely misinterpreting what's going on here... But note that ResStock also has columns in the buildstock.csv that are not used for simulation and are only used for sampling. These are often called meta-parameters and they are still added to the options_lookup.tsv to prevent validation errors. (For example, HVAC Cooling Type doesn't provide direct arguments for an OpenStudio measure, rather it's only used as a dependency for downstream parameters in sampling.) Is it problematic for ComStock to likewise just add these meta-parameters to the options_lookup.tsv? Simply bypassing the validation can obviously mean that legitimate errors are missed.

asparke2 · 2023-08-11T20:20:35Z

@rajeee I think this looks good with your proposed change. For ComStock, we'll figure out how to add checks later

…heck_compatibility

github-actions · 2023-08-16T20:46:41Z

File	Coverage
All files	`85%`	✅
base.py	`89%`	✅
eagle.py	`79%`	✅
exc.py	`57%`	✅
local.py	`50%`	✅
postprocessing.py	`84%`	✅
utils.py	`96%`	✅
sampler/base.py	`79%`	✅
sampler/downselect.py	`33%`	✅
sampler/precomputed.py	`93%`	✅
sampler/residential_quota.py	`61%`	✅
test/test_docker.py	`33%`	✅
test/test_validation.py	`97%`	✅
workflow_generator/base.py	`90%`	✅
workflow_generator/commercial.py	`53%`	✅
workflow_generator/residential_hpxml.py	`84%`	✅

Minimum allowed coverage is 33%

Generated by 🐒 cobertura-action against 1bcecac

nmerket · 2023-08-21T15:43:40Z

@asparke2 This looks fine to me. I would've preferred for that validation to be inside the sampler class for better organization, but I see why it wasn't. My only comment echoes @shorowit's above. Any thoughts about that?

asparke2 · 2023-08-30T12:34:34Z

@nmerket @shorowit I took at look at implementing your suggestion of adding no-op options_lookup.tsv rows for the unused buildstock.csv columns. This would require adding ~70,000 additional rows to the options_lookup.tsv, ~60k of which are just the unique set of census tract IDs. I think this would make that file even more difficult to edit. So my preference at this point is to just skip the validation for ComStock.

shorowit · 2023-08-30T14:56:38Z

Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them.

I think it would be a one-line change from:

buildstockbatch/buildstockbatch/base.py

Lines 341 to 343 in 42c2cb1

    
           if column not in param_option_dict: 
        
               errors.append(f'Column {column} in buildstock_csv is not available in options_lookup.tsv') 
        
               continue

to:

            if column not in param_option_dict:
                logger.warning(f'Column {column} in buildstock_csv is not available in options_lookup.tsv')
                continue

rajeee · 2023-08-30T15:36:42Z

Would it make more sense to leave the validation on but issue a warning for each column that is not found? It may be 70k rows, but probably not that many columns. ResStock similarly has thousands of no-op rows in their options_lookup.tsv, so they might also be interested in removing them.

I think it would be a one-line change from:

buildstockbatch/buildstockbatch/base.py

Lines 341 to 343 in 42c2cb1

if column not in param_option_dict:

errors.append(f'Column {column} in buildstock_csv is not available in options_lookup.tsv')

continue

to:
            if column not in param_option_dict:
                logger.warning(f'Column {column} in buildstock_csv is not available in options_lookup.tsv')
                continue

This would result in large number of false warnings which would drown other legitimate warnings (It's false warning if we expect to see them in correct options_lookup.tsv). It will also desensitize people from warnings. I am not a fan of this approach.

Can we have a middle ground? Change the validation so that, if a TSV is missing in options_lookup.tsv we ignore it and give it a pass. But if it is present, all of the options in the TSV should be present in the options_lookup.tsv?

asparke2 · 2023-08-30T15:41:47Z

@rajeee I think your suggestion makes sense: if a column is found, definitely useful to check that all of the column values are found in the options_lookup. For ComStock, this would at least catch the worst errors

shorowit · 2023-08-30T15:43:51Z

@rajeee Are you saying you want this?

            if column not in param_option_dict:
                continue

That would prevent false warnings, but also legitimate issues, no?

Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown.

rajeee · 2023-08-30T15:50:22Z

@rajeee Are you saying you want this?
            if column not in param_option_dict:
                continue
That would prevent false warnings, but also legitimate issues, no?

Maybe what's needed is a way for ResStock and ComStock to define which columns are allowed/expected to be missing. Could be hardcoded into BSB or defined in the yaml. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is still thrown.

Yes. That's what I was suggesting. True, it will miss some legitimate issues. It's a compromise.

We won't be losing a lot though. The TSVs we won't be validating are the ones that don't have any measure impact. They are basically just pass-through into the output. So, maybe it's not too bad to let the TSV themselves the be single source of truth on what those outputs are going to be. We might have to adjust the ResStock integrity check to allow for this.

asparke2 · 2023-08-30T15:53:21Z

@shorowit there are 57 intentionally unused/no-op columns in the ComStock buildstock.csv that show up as warnings (see below). Most of these are intermediate steps in the sampling that are useful. If we made the change you suggested, then I can reduce the number of warnings dramatically by adding most of the column/value pairs to the options_lookup.tsv and would only need to add ~800 rows to do so. The only remaining warnings would be for the census tract, county, PUMA, and weather file name columns.

buildstock.csv column	Number of unique values in column
nhgis_tract_gisjoin	57,721
nhgis_county_gisjoin	3,049
resstock_county_id	3,023
nhgis_puma_gisjoin	2,351
resstock_puma_id	2,337
weather_file_2018	1,033
weather_file_TMY3	880
year_built	219
reeds_balancing_area	134
year_bin_of_last_hvac_replacement	83
state_id	51
state_abbreviation	51
state_name	51
year_bin_of_last_walls_replacement	42
year_bin_of_original_building_construction	42
year_bin_of_last_windows_replacement	42
year_bin_of_last_roof_replacement	42
year_bin_of_last_service_water_heating_replacement	36
year_bin_of_last_exterior_lighting_replacement	36
year_bin_of_last_interior_equipment_replacement	34
year_bin_of_last_interior_lighting_replacement	26
american_housing_survey_region	25
climate_zone_ashrae_2004	17
energy_code_in_force_during_last_walls_replacement	15
energy_code_in_force_during_last_windows_replacement	15
energy_code_followed_during_last_windows_replacement	15
energy_code_in_force_during_last_hvac_replacement	15
energy_code_in_force_during_original_building_construction	15
energy_code_in_force_during_last_roof_replacement	15
energy_code_in_force_during_last_exterior_lighting_replacement	14
plugload_sch_base_peak_ratio_type	13
energy_code_in_force_during_last_interior_equipment_replacement	13
energy_code_in_force_during_last_interior_lighting_replacement	13
energy_code_in_force_during_last_service_water_heating_replacement	13
ltg_sch_base_peak_ratio_type	13
building_shape	12
resstock_custom_region	11
census_division_name_recs	11
census_division_name	9
region	9
climate_zone_building_america	8
iso_region	8
ownership_status	5
purchase_input_responsibility	4
census_region_name	4
party_responsible_for_operation	4
energy_code_followed_during_last_interior_lighting_replacement	2
building_size_lighting_tech	2
energy_code_compliance_roof	1
energy_code_compliance_interior_equipment	1
energy_code_compliance_during_original_building_construction	1
energy_code_compliance_walls	1
energy_code_compliance_interior_lighting	1
energy_code_compliance_exterior_lighting	1
energy_code_compliance_service_water_heating	1
energy_code_compliance_windows	1
energy_code_compliance_hvac	1
Grand Total	71,592

shorowit · 2023-08-30T16:01:11Z

True, it will miss some legitimate issues. It's a compromise.

We won't be losing a lot though.

I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start.

rajeee · 2023-08-30T16:08:50Z

I guess I don't agree. This came up enough times that we added these error checks. It can easily come up when dealing with different branches/forks of ResStock with different sets of TSVs, and then throw in developers having to resolve complicated merge conflicts at times. These error checks catch problems up front and prevent obscure downstream error messages that only get seen once the simulations start.

Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach.
Regardless, I do agree that there is value in catching these errors up front (maybe someone modified one of these no-op TSV but introduced some typo), but am curious to the extent of impact not doing so has.

shorowit · 2023-08-30T16:58:23Z

Can we get these downstream errors even for TSVs that don't pass any arguments to any measure? Those would be the only TSVs we won't be validating with the proposed approach.

We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal.

rajeee · 2023-08-30T17:53:05Z

We could get errors if a TSV should pass arguments to a measure but doesn't. But if its arguments aren't used, then I'd say you are correct that we can't get downstream errors for those TSVs. Sorry, I didn't realize your proposed approach would only validate TSVs with arguments; I guess I don't fully understand the proposal.

Looks like I spoke wrong on that one.

Let's break it down.

Currently, ResStock considers F4 to be incorrect, but ComStock considers it to be correct.

There are two proposals here.
Proposal 1 (from me above): Change ResStock's interpretation of F4 to correct from incorrect. This won't cause downstream errors and we get to have much smaller options_lookup.tsv. This would also allow comstock to have some validation. I proposed doing that by simply ignoring all TSVs that are not present in options_lookup.tsv. This will result in incorrect TSVs in cell D3 and D4 to be still validated. But you are rightly noting that it will miss the incorrect TSV in F3 in the table above. I assumed that TSVs that need pass arguments would only be incorrect by not having all the rows (D3) or passing wrong arguments (E4) and didn't think about not being present at all (F3). I believe this is less likely mode of error, but still possible error nonetheless. So, it's a valid shortcoming of this approach.

Proposal 2: (from Andrew originally) Keep status quo and skip validation for comstock. ComStock loses out in chance to catch incorrect TSVs early on, but ResStock gets to keep its convention of F4=incorrect and catch almost all cases of incorrect TSVs.

Note that we currently do not have a way to catch E3, E4 in buildstockbatch. Though it is options_lookup error not a characteristics TSV error.

So, if the shortcomings of proposal 1 is not acceptable for ResStock, maybe we could apply proposal 1 only to ComStock. That way, the validator for ResStock catches D3, D4, F3 and F4 and for ComStock it will catch D3 and D4 and miss out on F3 (F4 is correct for ComStock.)

shorowit · 2023-08-30T19:14:04Z

Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be ~~hardcoded into BSB or~~ defined in the yaml or some other text file. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is thrown.

asparke2 · 2023-08-30T20:04:08Z

Please do not hard-code columns into buildstockbatch. It is difficult for ComStock to keep up with buildstockbatch changes as-is, the more we can avoid having new BSB versions due to added/removed/changed buildstock.csv or options_lookup.tsv changes.

rajeee · 2023-08-30T20:26:20Z

Proposal 3: Define which columns are allowed/expected to be missing for ResStock and ComStock. Could be ~~hardcoded into BSB or~~ defined in the yaml or some other text file. If a column isn't found, and it's not on the list of approved columns that can be missing, then an error is thrown.

Building on this proposal, instead of maintaining two different files (option_lookup.tsv and the proposed new text file) to keep track of required/not-required TSVs, can we introduce a wildcard "*" in options_lookup.tsv to refer to any option. So, option_lookup.tsv would look like:

required_char1, option_1, <measure_arguments>
required_char1, option_2, <measure_arguments>
...
not_required_char_1,*, <no measure arguments>
not_required_char_2,*, <no measure arguments>

This will only increase the number of rows in option_lookup.tsv slightly for ComStock (only the TSV name for the currently missing TSVs needs to be added instead of one row for each of its options).

Maybe this is even more complicating things, but I kinda prefer this over maintaining one extra text file. Thoughts?

asparke2 · 2023-08-30T20:46:53Z

@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv

shorowit · 2023-08-30T21:03:46Z

I like the idea of using wildcards, @rajeee! Seems like a nice compromise.

nmerket · 2023-08-30T21:10:30Z

Wildcards sound like a great solution. I also agree that we should avoid hardcoding things into buildstockbatch.

rajeee · 2023-08-30T22:32:44Z

Glad we reached an agreement with the wildcard approach!

@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv

You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out?

asparke2 · 2023-08-31T03:09:06Z

Glad we reached an agreement with the wildcard approach!

@rajeee I'm happy with either your proposal of the no-op columns being defined with an * or listing out the no-op options for the columns. The main thing for ComStock is not throwing blocking errors when a column is found in the buildstock.csv but not in the options_lookup.tsv

You are planning to add the no-op TSV into options_lookup with * for ComStock after we implement this approach, right? If so, we can still throw blocking error when a column is found in buildstock.csv but not in options_lookup.tsv. Or, are you still considering to leave them out?

Yeah, if the * becomes a valid option then I will list all the no-op columns in the buildstock.csv

rajeee · 2023-09-02T00:25:47Z

@asparke2 I added the wildcard feature. Please check.

…heck_compatibility

nmerket · 2023-10-10T17:49:57Z

buildstockbatch/base.py

+            if "*" in param_option_dict[column]:
+                continue  # skip validating options when wildcard is present


nmerket · 2023-10-10T17:51:13Z

buildstockbatch/schemas/v0.3.yaml

-  type: str(required=True)
+  type: enum('residential_quota', 'residential_quota_downselect', 'precomputed', 'commercial_sobol', required=True)


I'd rather not do this. When someone goes to add a sampler, they'll have to remember to update this. I'll add a validation function that checks that the sampler actually exists instead.

nmerket · 2023-10-10T17:57:04Z

It looks like we're already checking and validating whether the sampler exists here.

buildstockbatch/buildstockbatch/base.py

Lines 309 to 315 in baf2eb6

    
           def validate_sampler(project_file): 
        
               cfg = get_project_configuration(project_file) 
        
               sampler_name = cfg['sampler']['type'] 
        
               try: 
        
                   Sampler = BuildStockBatchBase.get_sampler_class(sampler_name) 
        
               except AttributeError: 
        
                   raise ValidationError(f'Sampler class `{sampler_name}` is not available.')

I removed the redundant enumerations in the yaml schema.

@asparke2 I think this one looks good. I can merge if it comes back happy from CI. Do you want to do any additional testing of ComStock with it before I merge?

asparke2 · 2023-10-13T12:26:10Z

@nmerket @rajeee ComStock uses the string "NA" in the buildstock.csv and options_lookup.tsv. I believe that we never noticed this before because the measures that use "NA" interpret as a no-op, but now that all options are being validated, I think that changing read_csv to use: na_values=list(default_na_values - {"None", "NA"}) is the solution.

I don't know if ResStock uses "NA" anywhere.

asparke2 · 2023-10-13T13:07:48Z

@nmerket @rajeee the simulations are all failing because there are no valid options for the intentionally-unused parameters in the buildstock.csv file.

Intentionally-unused column in buildstock.csv: american_housing_survey_region

Error in out.osw file:

"step_errors" : 
[
   "Could not find parameter 'american_housing_survey_region' and option 'CBSA Riverside-San Bernardino-Ontario, CA' in /lib/resources/options_lookup.tsv.",

I think we will need to change this ComStock code and this ResStock code to know about the wildcard * and ignore that parameter for this code to work correctly.

asparke2 · 2023-10-13T13:15:20Z

This also makes me wonder if we should change * to something more descriptive like parameter_not_used_in_model_generation so that people editing the options_lookup.tsv are clearer about what this does and also the code in ResStock and ComStock that's parsing this are more obvious.

Soften buildstock.csv validation to throw warnings for ComStock compa…

24309df

…tibility

asparke2 requested a review from rajeee August 4, 2023 13:10

asparke2 assigned rajeee Aug 4, 2023

Handle buildstock_csv validation only for residential

4cc21fc

Merge remote-tracking branch 'origin/develop' into fix/comstock_err_c…

601b23c

…heck_compatibility

rajeee requested review from nmerket and removed request for rajeee August 17, 2023 20:02

Merge branch 'develop' into fix/comstock_err_check_compatibility

42c2cb1

Introduce wildcard option in options_lookup.tsv

2e1e1b1

rajeee added 2 commits September 1, 2023 22:54

Update changelog and schema changes

6ce1c32

Remove duplicated change log

dbb7974

nmerket added this to the v2023.10.0 milestone Oct 3, 2023

Merge remote-tracking branch 'origin/develop' into fix/comstock_err_c…

3d1044f

…heck_compatibility

nmerket reviewed Oct 10, 2023

View reviewed changes

removing sampler class enums from yaml schema

baf2eb6

Allow "NA" values in buildstock.csv for ComStock

1bcecac

nmerket approved these changes Oct 16, 2023

View reviewed changes

nmerket merged commit d93d3c2 into develop Oct 16, 2023
6 checks passed

nmerket deleted the fix/comstock_err_check_compatibility branch October 16, 2023 19:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soften buildstock.csv validation for ComStock compatibility #384

Soften buildstock.csv validation for ComStock compatibility #384

asparke2 commented Aug 4, 2023

shorowit commented Aug 8, 2023

asparke2 commented Aug 11, 2023

github-actions bot commented Aug 16, 2023 •

edited

Loading

nmerket commented Aug 21, 2023

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023 •

edited

Loading

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023 •

edited

Loading

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023 •

edited

Loading

shorowit commented Aug 30, 2023 •

edited

Loading

asparke2 commented Aug 30, 2023

rajeee commented Aug 30, 2023 •

edited

Loading

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

nmerket commented Aug 30, 2023

rajeee commented Aug 30, 2023

asparke2 commented Aug 31, 2023 •

edited

Loading

rajeee commented Sep 2, 2023

nmerket Oct 10, 2023

nmerket Oct 10, 2023

nmerket commented Oct 10, 2023

asparke2 commented Oct 13, 2023

asparke2 commented Oct 13, 2023

asparke2 commented Oct 13, 2023

		if "*" in param_option_dict[column]:
		continue # skip validating options when wildcard is present

		type: str(required=True)
		type: enum('residential_quota', 'residential_quota_downselect', 'precomputed', 'commercial_sobol', required=True)

Soften buildstock.csv validation for ComStock compatibility #384

Soften buildstock.csv validation for ComStock compatibility #384

Conversation

asparke2 commented Aug 4, 2023

shorowit commented Aug 8, 2023

asparke2 commented Aug 11, 2023

github-actions bot commented Aug 16, 2023 • edited Loading

nmerket commented Aug 21, 2023

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023 • edited Loading

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023 • edited Loading

shorowit commented Aug 30, 2023

rajeee commented Aug 30, 2023 • edited Loading

shorowit commented Aug 30, 2023 • edited Loading

asparke2 commented Aug 30, 2023

rajeee commented Aug 30, 2023 • edited Loading

asparke2 commented Aug 30, 2023

shorowit commented Aug 30, 2023

nmerket commented Aug 30, 2023

rajeee commented Aug 30, 2023

asparke2 commented Aug 31, 2023 • edited Loading

rajeee commented Sep 2, 2023

nmerket Oct 10, 2023

Choose a reason for hiding this comment

nmerket Oct 10, 2023

Choose a reason for hiding this comment

nmerket commented Oct 10, 2023

asparke2 commented Oct 13, 2023

asparke2 commented Oct 13, 2023

asparke2 commented Oct 13, 2023

github-actions bot commented Aug 16, 2023 •

edited

Loading

rajeee commented Aug 30, 2023 •

edited

Loading

rajeee commented Aug 30, 2023 •

edited

Loading

rajeee commented Aug 30, 2023 •

edited

Loading

shorowit commented Aug 30, 2023 •

edited

Loading

rajeee commented Aug 30, 2023 •

edited

Loading

asparke2 commented Aug 31, 2023 •

edited

Loading