52 ngrams #55

RFOxbury · 2024-06-20T15:12:53Z

Description

This PR includes the scripts we used for sampling the EYP/similar occupations data and analysing ngram matches from that data, as well as the complete inference pipeline - find_job_quality.py.

Fixes #52 #49 #56 #58

Instructions for Reviewer

In order to test the code in this PR you need to ...

Glance over the scripts and readme in analysis/mapping_evaluation/ but I don't think you need to run these scripts - it was an analysis from a specific point in time (sprint 7)
Run python dap_job_quality/pipeline/eyp/stratified_sample.py (it will run in test mode by default)
Run python dap_job_quality/pipeline/find_job_quality (it will run in test mode by default)

Please pay special attention to ...

Do the READMEs make sense? There's a lot of different stuff on this PR and I've tried to document it properly, but probably missed something...
find_job_quality.py - this is the most important script! Also, please add suggestions for more sensible ways of doing things if you spot them. This script largely works with dataframes but maybe it would be better to use Python data structures like dicts? Not sure what's most efficient.

Checklist:

lizgzil

This looks really great and find_job_quality.py ran and the results looked good! 🎉

I've left comments, but in summary:

➡️ A general comment is that there is some repeated code, which I know is because of the iterative way of working on this, but I think for future clarity it'd be nice to get rid of some of the repeats and use find_job_quality.py as the core location for important functions.

put split_text in utils/text_cleaning.py - so needs importing from this file into find_job_quality and in get_ngrams_and_matches
in get_ngrams_and_matches.py do from dap_job_quality.pipeline.find_job_quality import split_ngrams, extract_ngrams, match_to_lookup

➡️ I think you could also add a main pipeline/README.md to explain what to run and what is outputted (you can just point to the dap_job_quality/pipeline/eyp/ngram_analysis/README.md where you explain the scripts in that folder).

➡️ I suggested a speed up for your embeddings.

➡️ dap_job_quality/pipeline/eyp/stratified_sample.py didn't work for the sample for me.. might just be because of the small sample I used. I have suggested another way - but I may have misunderstood something!

➡️ In another PR (not for now) it would be great to refactor find_job_quality.py so that people can use it using the bare minimum of commands. e.g. in the skills extractor our functions allowed the user to simply run

from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills #import the module

es = ExtractSkills(config_name="extract_skills_toy", local=True) #instantiate with toy taxonomy configuration file

es.load() #load necessary models

job_adverts = [
    "The job involves communication skills and maths skills",
    "The job involves Excel skills. You will also need good presentation skills"
] #toy job advert examples

job_skills_matched = es.extract_skills(job_adverts) #match and extract skills to toy taxonomy

to extract skills. I think we can do this here too, but as it stands the user would need to add many lines of code as they aren't all enclosed in functions.

dap_job_quality/analysis/mapping_evaluation/matches_for_evaluation.py

dap_job_quality/pipeline/eyp/ngram_analysis/extract_jq_sentences.py

dap_job_quality/pipeline/eyp/ngram_analysis/README.md

dap_job_quality/pipeline/eyp/ngram_analysis/get_embeddings.py

dap_job_quality/pipeline/eyp/ngram_analysis/get_ngrams_and_matches.py

lizgzil · 2024-06-25T10:26:18Z

dap_job_quality/pipeline/eyp/stratified_sample.py

+    splitter = StratifiedShuffleSplit(
+        n_splits=1, test_size=args.sample_size, random_state=42
+    )
+
+    # Perform the stratified sampling
+    for train_index, test_index in splitter.split(
+        all_job_ads, all_job_ads["stratify_col"]
+    ):
+        stratified_sample = all_job_ads.iloc[test_index]
+
+    stratified_sample.drop(columns=["stratify_col"], inplace=True)


This isn't working for me. I get the error:

Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Users/elizabethgallagher/miniconda3/envs/dap_job_quality/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 1841, in split for train, test in self._iter_indices(X, y, groups): File "/Users/elizabethgallagher/miniconda3/envs/dap_job_quality/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2258, in _iter_indices raise ValueError( ValueError: The test_size = 100 should be greater or equal to the number of classes = 258

it also looks like the for loop is only saving the final loop results for stratified_sample? I'm not used to using StratifiedShuffleSplit though so maybe I've misunderstood something?

I was wondering whether the following would work instead though? Since you have already provided the stratification by creating the stratify_col column, I'm not sure what StratifiedShuffleSplit is adding? so I think you just need to groupby it and sample on each group?

stratified_sample = all_job_ads.groupby('stratify_col').apply(lambda x: x.sample(min(args.sample_size, len(x))), include_groups=False).reset_index()

This is a really good question and I had to check the answer. I think it's correct to use StratifiedShuffleSplit in this way: we've said that we want the test set to have size args.sample_size, and then in the for loop, we just take the data that has been earmarked for the test set.

With the code you've suggested, does the include_groups=False part mean that sample_size = the total size (not the size of each single group)?

dap_job_quality/pipeline/ngrams/get_most_common_ngrams.py

lizgzil · 2024-06-25T14:00:42Z

dap_job_quality/utils/jobbert.py

@@ -0,0 +1,74 @@
+import time


I think this can be simplified and quickened if you use sentence_transformers:

import time import torch from typing import List from sentence_transformers import SentenceTransformer MAX_LENGTH = 81 # This is the 99th percentile of N tokens in job advert sentences :) # Function to embed sentences def embed_sentences( sentences: List[str], model_name: str = "jjzha/jobbert-base-cased", batch_size: int = 32, device: str = "cuda" if torch.cuda.is_available() else "cpu", ): start_time = time.time() jobbert_model = SentenceTransformer(model_name, device=device) jobbert_model.max_seq_length = MAX_LENGTH embeddings = jobbert_model.encode(sentences, batch_size=batch_size) elapsed_time = time.time() - start_time print(f"Batch size: {batch_size}, Time taken: {elapsed_time:.2f} seconds") return embeddings

I tested using your original function and this new one and the embeddings look the same, apart from using sentence_transformers outputs more decimal places.

Original method embed_sentences(jobs_df["sentences"].tolist()[0:10], JOBBERT, 64)

jobbert_model.encode(jobs_df["sentences"].tolist()[0:10], batch_size=64):

The rest of your algorithm doesn't seem to mind whether the output is a tensor or a numpy array. The original method took 335.82 seconds to process 100 job adverts, and this suggested method took 164.43 seconds.

The final output from running find_job_quality.py (jq_df_filtered) from this sample also looks exactly the same:

This blows my mind! I thought that you could only use SentenceTransformers with a sentence model (rather than a masking / classification etc one) 🤯 That's a pretty big time saving! Thanks Liz

dap_job_quality/pipeline/eyp/ngram_analysis/get_ngrams_and_matches.py

Co-authored-by: Liz G <[email protected]>

…ches.py Co-authored-by: Liz G <[email protected]>

RFOxbury added 28 commits June 13, 2024 13:56

WIP: update prodigy utils for handling new labelled data

52b8d2f

Add script for calculating most common ngrams

d3ff0e6

WIP

54e3fc4

Add logging

a31b253

Tweak logistic regression code to work on aws

fbb4ad9

WIP: getting JQ sentences from retail and EYP sample

ba93f16

WIP: small changes to make the code work on ec2

8a7ad33

WIP: updating file structure

2eb73b1

WIP: save models

8cd8891

WIP: small updates

d9466e1

WIP: plumbing extraction of JQ sentences and ngrams

2da7709

WIP: early years analysis pipeline. Refactor notebook into scripts

998a176

Make sure dependencies are loaded

1cf1670

WIP: create unique identifier from id and index

a55d5bc

Make parent directories if they do not exist

def33c9

Refactor EYP pipeline

caa6021

Merge branch 'dev' into 52-ngrams

4e14f66

WIP: total inference pipeline

b3b8ddd

Update log_reg to use refactored embedding function

1e1324b

Update deleted file

1696cdd

Add imports

64e6486

Add correct file path

3b44adc

Preserve original df

9380437

Fix error

7a5fa68

Update sample size

8c81f8e

Add code for evaluating current matching pipeline

cb022c5

WIP: bar charts to show

13eeb9d

Add docstrings and readmes

4a3c341

RFOxbury marked this pull request as ready for review June 21, 2024 17:05

Delete old notebook

1adc9ea

Merge branch 'dev' into 52-ngrams

9199e94

lizgzil requested changes Jun 25, 2024

View reviewed changes

RFOxbury and others added 12 commits July 1, 2024 11:47

Update dap_job_quality/pipeline/eyp/ngram_analysis/README.md

52bd3d4

Co-authored-by: Liz G <[email protected]>

Refactor

c3a70a2

Update dap_job_quality/pipeline/eyp/ngram_analysis/get_ngrams_and_mat…

97bfe6d

…ches.py Co-authored-by: Liz G <[email protected]>

Delete file that is no longer needed

d38adab

Fix bug in stratified sample code - change from number to proportion

8190d03

Add Liz's speed fix

ae582e6

Add readme

4531739

Update getters

876a46f

Add some docstrings

2819085

Refactor

d414b3c

Refactor

348a69c

Add notebook with EDA for checkpoint

95b7171

RFOxbury merged commit 6f03e2c into dev Jul 2, 2024

RFOxbury deleted the 52-ngrams branch July 2, 2024 08:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

52 ngrams #55

52 ngrams #55

RFOxbury commented Jun 20, 2024 •

edited

Loading

lizgzil left a comment

lizgzil Jun 25, 2024

RFOxbury Jul 1, 2024

lizgzil Jun 25, 2024

RFOxbury Jul 1, 2024

52 ngrams #55

52 ngrams #55

Conversation

RFOxbury commented Jun 20, 2024 • edited Loading

Description

Instructions for Reviewer

Checklist:

lizgzil left a comment

Choose a reason for hiding this comment

lizgzil Jun 25, 2024

Choose a reason for hiding this comment

RFOxbury Jul 1, 2024

Choose a reason for hiding this comment

lizgzil Jun 25, 2024

Choose a reason for hiding this comment

RFOxbury Jul 1, 2024

Choose a reason for hiding this comment

RFOxbury commented Jun 20, 2024 •

edited

Loading