Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

52 ngrams #55

Merged
merged 42 commits into from
Jul 2, 2024
Merged

52 ngrams #55

merged 42 commits into from
Jul 2, 2024

Conversation

RFOxbury
Copy link
Contributor

@RFOxbury RFOxbury commented Jun 20, 2024


Description

This PR includes the scripts we used for sampling the EYP/similar occupations data and analysing ngram matches from that data, as well as the complete inference pipeline - find_job_quality.py.

Fixes #52 #49 #56 #58

Instructions for Reviewer

In order to test the code in this PR you need to ...

  • Glance over the scripts and readme in analysis/mapping_evaluation/ but I don't think you need to run these scripts - it was an analysis from a specific point in time (sprint 7)
  • Run python dap_job_quality/pipeline/eyp/stratified_sample.py (it will run in test mode by default)
  • Run python dap_job_quality/pipeline/find_job_quality (it will run in test mode by default)

Please pay special attention to ...

  • Do the READMEs make sense? There's a lot of different stuff on this PR and I've tried to document it properly, but probably missed something...
  • find_job_quality.py - this is the most important script! Also, please add suggestions for more sensible ways of doing things if you spot them. This script largely works with dataframes but maybe it would be better to use Python data structures like dicts? Not sure what's most efficient.

Checklist:

  • I have refactored my code out from notebooks/
  • I have checked the code runs
  • I have tested the code
  • I have run pre-commit and addressed any issues not automatically fixed
  • I have merged any new changes from dev
  • I have documented the code
    • Major functions have docstrings
    • Appropriate information has been added to READMEs
  • I have explained this PR above
  • I have requested a code review

@RFOxbury RFOxbury marked this pull request as ready for review June 21, 2024 17:05
Copy link
Contributor

@lizgzil lizgzil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great and find_job_quality.py ran and the results looked good! 🎉

I've left comments, but in summary:

➡️ A general comment is that there is some repeated code, which I know is because of the iterative way of working on this, but I think for future clarity it'd be nice to get rid of some of the repeats and use find_job_quality.py as the core location for important functions.

  • put split_text in utils/text_cleaning.py - so needs importing from this file into find_job_quality and in get_ngrams_and_matches
  • in get_ngrams_and_matches.py do from dap_job_quality.pipeline.find_job_quality import split_ngrams, extract_ngrams, match_to_lookup

➡️ I think you could also add a main pipeline/README.md to explain what to run and what is outputted (you can just point to the dap_job_quality/pipeline/eyp/ngram_analysis/README.md where you explain the scripts in that folder).

➡️ I suggested a speed up for your embeddings.

➡️ dap_job_quality/pipeline/eyp/stratified_sample.py didn't work for the sample for me.. might just be because of the small sample I used. I have suggested another way - but I may have misunderstood something!

➡️ In another PR (not for now) it would be great to refactor find_job_quality.py so that people can use it using the bare minimum of commands. e.g. in the skills extractor our functions allowed the user to simply run

from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills #import the module

es = ExtractSkills(config_name="extract_skills_toy", local=True) #instantiate with toy taxonomy configuration file

es.load() #load necessary models

job_adverts = [
    "The job involves communication skills and maths skills",
    "The job involves Excel skills. You will also need good presentation skills"
] #toy job advert examples

job_skills_matched = es.extract_skills(job_adverts) #match and extract skills to toy taxonomy

to extract skills. I think we can do this here too, but as it stands the user would need to add many lines of code as they aren't all enclosed in functions.

dap_job_quality/pipeline/eyp/ngram_analysis/README.md Outdated Show resolved Hide resolved
Comment on lines +46 to +56
splitter = StratifiedShuffleSplit(
n_splits=1, test_size=args.sample_size, random_state=42
)

# Perform the stratified sampling
for train_index, test_index in splitter.split(
all_job_ads, all_job_ads["stratify_col"]
):
stratified_sample = all_job_ads.iloc[test_index]

stratified_sample.drop(columns=["stratify_col"], inplace=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't working for me. I get the error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/elizabethgallagher/miniconda3/envs/dap_job_quality/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 1841, in split
    for train, test in self._iter_indices(X, y, groups):
  File "/Users/elizabethgallagher/miniconda3/envs/dap_job_quality/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2258, in _iter_indices
    raise ValueError(
ValueError: The test_size = 100 should be greater or equal to the number of classes = 258

it also looks like the for loop is only saving the final loop results for stratified_sample? I'm not used to using StratifiedShuffleSplit though so maybe I've misunderstood something?

I was wondering whether the following would work instead though? Since you have already provided the stratification by creating the stratify_col column, I'm not sure what StratifiedShuffleSplit is adding? so I think you just need to groupby it and sample on each group?

stratified_sample = all_job_ads.groupby('stratify_col').apply(lambda x: x.sample(min(args.sample_size, len(x))), include_groups=False).reset_index()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really good question and I had to check the answer. I think it's correct to use StratifiedShuffleSplit in this way: we've said that we want the test set to have size args.sample_size, and then in the for loop, we just take the data that has been earmarked for the test set.

With the code you've suggested, does the include_groups=False part mean that sample_size = the total size (not the size of each single group)?

dap_job_quality/pipeline/ngrams/get_most_common_ngrams.py Outdated Show resolved Hide resolved
@@ -0,0 +1,74 @@
import time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be simplified and quickened if you use sentence_transformers:

import time
import torch
from typing import List
from sentence_transformers import SentenceTransformer
MAX_LENGTH = 81 # This is the 99th percentile of N tokens in job advert sentences :)

# Function to embed sentences
def embed_sentences(
    sentences: List[str],
    model_name: str = "jjzha/jobbert-base-cased",
    batch_size: int = 32,
    device: str = "cuda" if torch.cuda.is_available() else "cpu",
):

    start_time = time.time()
    jobbert_model = SentenceTransformer(model_name, device=device)
    jobbert_model.max_seq_length = MAX_LENGTH
    embeddings = jobbert_model.encode(sentences, batch_size=batch_size)

    elapsed_time = time.time() - start_time
    print(f"Batch size: {batch_size}, Time taken: {elapsed_time:.2f} seconds")

    return embeddings

I tested using your original function and this new one and the embeddings look the same, apart from using sentence_transformers outputs more decimal places.

Original method embed_sentences(jobs_df["sentences"].tolist()[0:10], JOBBERT, 64)

Screenshot 2024-06-25 at 14 48 06

jobbert_model.encode(jobs_df["sentences"].tolist()[0:10], batch_size=64):
Screenshot 2024-06-25 at 14 48 12

The rest of your algorithm doesn't seem to mind whether the output is a tensor or a numpy array. The original method took 335.82 seconds to process 100 job adverts, and this suggested method took 164.43 seconds.

The final output from running find_job_quality.py (jq_df_filtered) from this sample also looks exactly the same:
Screenshot 2024-06-25 at 14 59 17

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This blows my mind! I thought that you could only use SentenceTransformers with a sentence model (rather than a masking / classification etc one) 🤯 That's a pretty big time saving! Thanks Liz

@RFOxbury RFOxbury merged commit 6f03e2c into dev Jul 2, 2024
@RFOxbury RFOxbury deleted the 52-ngrams branch July 2, 2024 08:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JQ] calculating how much ngrams come up
2 participants