Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update measures for Nov 2024 #135

Merged
merged 14 commits into from
Feb 18, 2025
Merged

Update measures for Nov 2024 #135

merged 14 commits into from
Feb 18, 2025

Conversation

lizgzil
Copy link
Contributor

@lizgzil lizgzil commented Nov 15, 2024


Note: the current GJE uses data and plots created in this PR

Description

This PR updates the OJO analysis to use the extra job adverts from Nov 2023 to Nov 2024.

  • Reading and deduplicating using new data
  • New flows for the data update (which just extracts green measures for the new lot of data and merges it with the old)
  • Updating readmes and configs
  • Aggregation scripts are replaced by a quicker method in create_aggregated_data.py
  • New script to nicely format all outputs for the GJE download option (with data descriptions)
  • Processing data changes due to new format of the datasets
  • Update all plotting notebooks and Flourish-ready data outputs
  • Temporal analysis notebook

Fixes # (issue)

#134 #132

In order to test the code in this PR you need to ...

Please pay special attention to ...

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

Sorry, something went wrong.

@lizgzil lizgzil changed the title Create new update flow for industries Update measures for Nov 2024 Nov 18, 2024
…upation flow for update
…d add a readme
…ojo_green_measures
…early aggregation of data to create_aggregated_data
… public consumption, and add these to the readme
…g in Flourish
Comment on lines +128 to +150
job_desc_chunks = list(partition_all(chunk_size, ojo_jobs_data))

t0 = time.time()
for i, job_desc_chunk in tqdm(enumerate(job_desc_chunks)):
ind_green_measures_dict = im.get_measures(job_desc_chunk)
save_to_s3(
BUCKET_NAME,
ind_green_measures_dict,
os.path.join(
inds_output_folder,
f"ojo_newest_industry_green_measures_production_{production}_interim/{i}.json",
),
)

# Read them back in and save altogether
ind_measures_locs = get_s3_data_paths(
BUCKET_NAME,
os.path.join(
inds_output_folder,
f"ojo_newest_industry_green_measures_production_{production}_interim",
),
file_types=["*.json"],
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@crispy-wonton here we do the processing of batches of 5000 job adverts (which takes a while to run) and save the outputs in interim files. Then at the end we read them all in together and save again.

@lizgzil lizgzil merged commit 9002981 into dev Feb 18, 2025
1 check passed
@lizgzil lizgzil deleted the update-run-measures branch February 18, 2025 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant