Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

107 scaling HN comparison to all LAs and refactoring of the code #107

Open
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

helloaidank
Copy link
Collaborator

@helloaidank helloaidank commented Jan 10, 2025


Description

This PR extends the comparison_of_hn_zones.py script to process all available Local Authorities (LAs), rather than just Liverpool. Additionally, the codebase has been refactored for better modularity, maintainability, and efficiency.

Key changes include:

Major Changes

Scaling to All LAs

  • The script now iterates through all LAs listed in hnz_config.py rather than being hardcoded to Liverpool.
  • Special handling for Greater Manchester and other grouped LAs.

Extracted Utility Functions:

  • Created log_save_utils.py for logging, path management, and file saving.
  • Created spatial_utils.py to handle spatial joins and CRS transformations.
  • Created hnz_comparison_analysis_utils.py for statistical calculations (e.g., MAE, suitability scores).
  • Created nesta_hp_suitability_utils.py to filter heat pump suitability data for the relevant LAs.
  • Moved configurable paths, parameters, and LAs to hnz_config.py.

Improvements to Efficiency and Code Structure

  • Used config-based file path management to avoid hardcoded paths.
  • Extracted file I/O and S3 operations into helper functions.

Outputs

Each LA now generates the following output files:

  • <la_name>_with_desnz_hn_lsoa.gpkg: Processed GeoDataFrame with LSOA spatial joins.
  • <la_name>_hp_suitability_lsoas.json: List of unique LSOA codes per LA.
  • <la_name>_hp_suitability_scores_with_desnz.parquet: Processed suitability scores with DESNZ pilot fraction and MAE.
  • script_output.log: Detailed logs on calculations and data processing.

Instructions for Reviewer

Testing the Script

To run the script and verify the outputs, use:

python asf_heat_pump_suitability/analysis/hn_zones/comparison_of_hn_zones.py --read_in_s3 --save_to_s3

This should output multiple data files in the relevant s3 bucket (s3://asf-heat-pump-suitability/evaluation/desnz_hn_zone_scores/) for all LAs.

Code Structure Review

Just have a look at the way I've refactored the code and make sure everything makes sense (utils and config done in the right manner, anything obvious missing in this refactoring).

Things to pay special attention to

Correctness of Outputs

  • Do the generated GPKG, JSON, and Parquet files contain expected data?
  • Are the MAE calculations and suitability scores consistent with previous results?
  • The plot_comparison_of_hn_zones.py should still work for the Liverpool outputs from this script, make sure that's the case.

Handling of Special Cases

  • For Greater Manchester, the script processes sub-LAs separately—please check if this works correctly. I think a secondary issue will come to the plotting of this region with the way the current plotting script is structured but I will come to that in the plotting PR.
  • LSOAs outside LA boundaries (but still in DESNZ HN zones) are not included for now—should they be? I guess ideally we would but I'm just wondering if at the moment we are happy enough with this validation of our model.

Thank you for taking the time to review this PR! Please do let me know if you have any questions.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@helloaidank helloaidank changed the title scaling to all LAs and refactoring of the code 107 scaling to all LAs and refactoring of the code Jan 10, 2025
@helloaidank helloaidank changed the title 107 scaling to all LAs and refactoring of the code 107 scaling HN comparison to all LAs and refactoring of the code Jan 10, 2025
@helloaidank helloaidank marked this pull request as ready for review January 10, 2025 18:01
Copy link
Collaborator

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @helloaidank from what I could see this looks great - thanks a lot! The script ran perfectly and the results are really encouraging, and I couldnt see any obvious bugs. Just a few suggestions.

# 6. After all LAs are processed, save the combined MAE data as CSV
mae_df = pl.DataFrame(la_mae_data)
csv_path = os.path.join(output_dir, "la_mae_data.csv")
mae_df.write_csv(csv_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to save this to s3 too

Copy link
Collaborator

@crispy-wonton crispy-wonton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @helloaidank ,
Thanks for all this work, it looks good! I didn't test the code as Liz has done that. I didn't find any bugs or anything. I did leave some suggestions for moving functions to other existing files where they would fit in, plus a couple other minor suggestions.

I also think it might be useful to have these functions refactored so they can compare either the Nesta / conventional view scores to the HN zones. It would be interesting to see if the results are very different. This should definitely be in a separate PR and is a nice to have. Would you mind please adding it as an issue into the repo?

Thank you :D

@helloaidank
Copy link
Collaborator Author

Hi both @lizgzil and @crispy-wonton! Thank you both for reviewing the code , very useful suggestions, I've made the suggested changes and where I didn't, I've explained why not!

This took longer than I thought as it's clearly no longer morning 🤣

@helloaidank
Copy link
Collaborator Author

Hi @crispy-wonton and @lizgzil, reviving this PR, potentially it would be good to have a quick check if you're happy with the changes I've made and also explanations I've given. I've made some changes to some of the hp suitability util files and also implemented some more concise code.

Happy to discuss at stand-up as well.

Copy link
Collaborator

@crispy-wonton crispy-wonton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@helloaidank great work here - thanks for making the changes. Everything looking good and neat. I spotted a couple of minor changes to make, but all the logic looks good. I ran the following code as given in your PR description:

python asf_heat_pump_suitability/analysis/hn_zones/comparison_of_hn_zones.py --read_in_s3 --save_to_s3

It worked well and I could see it looping through LAs and producing outputs. I interrupted it after a few iterations to save time.

subfolder: str,
):
"""
Upload a local file to an S3 bucket, used for evaluation of our heat network suitability model using the DESNZ Heat Network pilot zones.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Upload a local file to an S3 bucket, used for evaluation of our heat network suitability model using the DESNZ Heat Network pilot zones.
Upload a local file to an S3 bucket.

Just to make the docstring more generic to match the function :)

local_file_path: str,
s3_bucket: str,
s3_key_dir: str,
save_to_s3: bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_to_s3 needs to be removed from here and the docstring

Comment on lines +48 to +51
s3_key_dir (str): S3 key (path) where the file should be uploaded.
save_to_s3 (bool): boolean which indicates whether to save or not the file to s3.
filename (str): The actual filename to store in S3.
subfolder (str): Subfolder within S3.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by the diff between s3_key_dir and subfolder. Aren't these, plus file name, in their entirety referred to as keys?

Comment on lines +50 to +60
dict: A dictionary containing paths for LIVERPOOL_GPKG_PATH, LSOA_SHP_PATH, and NESTA_HP_SUITABILITY_PARQUET_PATH.
"""
paths = {}
if read_in_s3:
paths["LSOA_SHP_PATH"] = LSOA_SHP_PATH_S3
paths["NESTA_HP_SUITABILITY_PARQUET_PATH"] = NESTA_HPS_PARQUET_S3

else:
paths["LSOA_SHP_PATH"] = LSOA_SHP_PATH_LOCAL
paths["NESTA_HP_SUITABILITY_PARQUET_PATH"] = NESTA_HPS_PARQUET_LOCAL
return paths
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dict is missing the LIVERPOOL_GPKG_PATH. Not sure which of function or docstring needs updating.

Args:
gdf (gpd.GeoDataFrame): The GeoDataFrame to write.
output_dir (str): Local directory to save the GPKG.
filename_prefix (str): The prefix used for naming the GPKG file.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filename_prefix (str): The prefix used for naming the GPKG file.
filename_prefix (str): The prefix used for naming the GPKG file. `filename_prefix` will be joined to `_with_desnz_hn_lsoa.gpkg` to generate full filename.

)

# Extract unique LSOA codes
unique_lsoa_codes = intersection_gdf["LSOA21CD"].dropna().unique().tolist()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any/a lot of rows that get dropped here because of drop na?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, it would be good to include a logging warning here to note how many rows are lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants