Skip to content

Commit

Permalink
Update measures for Nov 2024 (#135)
Browse files Browse the repository at this point in the history
* Create new update flow for industries

* Update industry rerun information in readmes etc

* Update industry rerun information in readmes etc

* Create skills flow update

* remove saving embeddings

* Update readmes about the refresh, save things out to parquet, add occupation flow for update

* Reboot aggregation steps for new data refresh - delete old scripts and add a readme

* clean up create_aggregate_data and remove old functions from process_ojo_green_measures

* Rerun high level analysis notebook

* Add extra parts of the esco taxononmy to the name mapper, and add a yearly aggregation of data to create_aggregated_data

* Fix year to text issue, create a script to nicely format gje data for public consumption, and add these to the readme

* Update plotting notebooks with creating datasets suitable for plotting in Flourish

* Remove betting shop managers from temporal change plots

* Change ojd_daps_skills version to version used locally
  • Loading branch information
lizgzil authored Feb 18, 2025
1 parent c5a5f2f commit 9002981
Show file tree
Hide file tree
Showing 27 changed files with 4,791 additions and 1,642 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
.recipes/
.cookiecutter/state/
dap_prinz_green_jobs/notebooks/

*.lock

Expand Down
49 changes: 35 additions & 14 deletions dap_prinz_green_jobs/analysis/ojo_analysis/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,38 +4,59 @@ This folder contains scripts to aggregate data at the SIC-, SOC- and region-leve

### Skills formatting

To aggregate the data we need to utilise all the skills per job advert and create a new dataset which has a single skill per row.

This process takes a long time but is used by all the aggregation scripts, so we can first create it by running:
To speed up the aggregating step we process the skills datasets into more manageable forms. This is done by running

```
python dap_prinz_green_jobs/analysis/ojo_analysis/process_full_skills_data.py
```

This creates a file named `exploded_all_ojo_large_sample_skills_green_measures_production_True.csv` which will be stored in the same date stamped S3 folder as the extracted skills were.
And going forward information about the skills and green skills proportions are stored in 3 locations:

1. The skills extracted and mapped to ESCO (not just green) `s3://prinz-green-jobs/outputs/data/ojo_application/deduplicated_sample/20241114/latest_update_20241114_skills.parquet` (with columns ['id', 'skill_label', 'esco_label', 'esco_id']). A smaller version of this file with just the job advert and ESCO id columns is in `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_exploded.parquet`.
2. The green skills extracted and mapped to green ESCO `outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_green_measures_exploded_green.parquet` (with columns ['job_id', 'skill_label', 'extracted_green_skill', 'extracted_green_skill_id', 'green_skill_preferred_name'])
3. Information on the number of skills and proportion of green skills `outputs/data/ojo_application/extracted_green_measures/20241118/ojo_all_skills_green_measures_skill_metrics.parquet` (with columns ['job_id', 'prop_green_with_hs', 'NUM_ORIG_ENTS', 'NUM_SPLIT_ENTS', 'num_all_skills_ojo', 'count_green_skills_no_hs', 'PROP_GREEN'])

### Data aggregation

To aggregate OJO data with extracted green measures (as defined in `ojo_analysis.yaml`), run the following commands:
To aggregate OJO data with extracted green measures (as defined in `ojo_analysis.yaml`), run:

```
python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py #to aggregate by ITL regions
python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_soc.py #to aggregate by SOC codes
python dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py #to aggregate by SIC codes
python dap_prinz_green_jobs/analysis/ojo_analysis/create_aggregated_data.py
```

Meanwhile, the `process_ojo_green_measures.py` file contains methods for analysis. These are largely used in the `notebooks/` directory to generate graphs for the Green Jobs Explorer tool.
to aggregate the data by SOC, SIC and ITL regions. This script draws on functions from `process_ojo_green_measures.py`.

This will also format the occupation aggregated data into a form suitable for the Green Jobs Explorer website - these are very superficial changes needed to create the website - e.g. changing single to double quotation marks.

### Finding similar occupations based of skills asked for

In `create_aggregated_data.py` the similarities of occupations are also created using functions from `occupation_similarity.py`. To do this, a matrix of the proportions of all skills per occupation is created, and then each row of this matrix is compared using cosine similarity to find the closest occupations to one another based off which skills are asked for. The output `occupation_aggregated_data_{DATE}_extra.csv` contains an additional column containing the list of similar occupations.

Finally, some minor tweaks for the dataset the powers the Green Jobs Explorer tool are done by running:
### Final files

Lots of files are outputted in the `s3://prinz-green-jobs/outputs/data/ojo_application/extracted_green_measures/analysis/20241121/` folder from the previous scripts, the most important for analysis are:

1. `occupation_aggregated_data_20241121_extra_gjeformat.csv`: The data which powers the Green Jobs Explorer website. This is the aggregated data per occupation (SOC_EXT) with occupations with less than 50 job adverts removed.
2. `industry_aggregated_data_20241121.csv`: The data aggregated by SIC.
3. `all_itl_aggregated_data_20241121.csv`: The data aggregated by each of ITL 1, 2 and 3.

### Data for the Green Jobs Explorer download

Although the data the powers the GJE is produced in `create_aggregated_data.py`, there is an additional step to create a nicely formatted xlxs dataset with information sheets about the column names etc. This is for a user to download.

This is created by running:

```
python dap_prinz_green_jobs/analysis/ojo_analysis/gje_formatting.py
python dap_prinz_green_jobs/analysis/ojo_analysis/create_open_gje_data.py
```

These are very superficial changes needed to create the website - e.g. changing single to double quotation marks.
it essentially renames columns, deletes some columns and creates a data explaination sheet to go alongside it.

### Finding similar occupations based of skills asked for
The outputs are saved to:

In `aggregate_by_soc.py` the similarities of occupations are also created using functions from `occupation_similarity.py`. To do this, a matrix of the proportions of all skills per occupation is created, and then each row of this matrix is compared using cosine similarity to find the closest occupations to one another based off which skills are asked for. The output `occupation_aggregated_data_{DATE}_extra.csv` contains an additional column containing the list of similar occupations.
- `s3://nesta-open-data/green_jobs_explorer/occupation_aggregated_data_20241121_GJE.xlsx`
- `s3://nesta-open-data/green_jobs_explorer/industry_aggregated_data_20241121_GJE.xlsx`
- `s3://nesta-open-data/green_jobs_explorer/region_aggregated_data_20241121_GJE.xlsx`
125 changes: 0 additions & 125 deletions dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_region.py

This file was deleted.

105 changes: 0 additions & 105 deletions dap_prinz_green_jobs/analysis/ojo_analysis/aggregate_by_sic.py

This file was deleted.

Loading

0 comments on commit 9002981

Please sign in to comment.