-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
52 ngrams #55
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really great and find_job_quality.py
ran and the results looked good! 🎉
I've left comments, but in summary:
➡️ A general comment is that there is some repeated code, which I know is because of the iterative way of working on this, but I think for future clarity it'd be nice to get rid of some of the repeats and use find_job_quality.py
as the core location for important functions.
- put
split_text
inutils/text_cleaning.py
- so needs importing from this file intofind_job_quality
and inget_ngrams_and_matches
- in
get_ngrams_and_matches.py
dofrom dap_job_quality.pipeline.find_job_quality import split_ngrams, extract_ngrams, match_to_lookup
➡️ I think you could also add a main pipeline/README.md to explain what to run and what is outputted (you can just point to the dap_job_quality/pipeline/eyp/ngram_analysis/README.md
where you explain the scripts in that folder).
➡️ I suggested a speed up for your embeddings.
➡️ dap_job_quality/pipeline/eyp/stratified_sample.py
didn't work for the sample for me.. might just be because of the small sample I used. I have suggested another way - but I may have misunderstood something!
➡️ In another PR (not for now) it would be great to refactor find_job_quality.py
so that people can use it using the bare minimum of commands. e.g. in the skills extractor our functions allowed the user to simply run
from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills #import the module
es = ExtractSkills(config_name="extract_skills_toy", local=True) #instantiate with toy taxonomy configuration file
es.load() #load necessary models
job_adverts = [
"The job involves communication skills and maths skills",
"The job involves Excel skills. You will also need good presentation skills"
] #toy job advert examples
job_skills_matched = es.extract_skills(job_adverts) #match and extract skills to toy taxonomy
to extract skills. I think we can do this here too, but as it stands the user would need to add many lines of code as they aren't all enclosed in functions.
dap_job_quality/analysis/mapping_evaluation/matches_for_evaluation.py
Outdated
Show resolved
Hide resolved
dap_job_quality/pipeline/eyp/ngram_analysis/extract_jq_sentences.py
Outdated
Show resolved
Hide resolved
dap_job_quality/pipeline/eyp/ngram_analysis/get_ngrams_and_matches.py
Outdated
Show resolved
Hide resolved
splitter = StratifiedShuffleSplit( | ||
n_splits=1, test_size=args.sample_size, random_state=42 | ||
) | ||
|
||
# Perform the stratified sampling | ||
for train_index, test_index in splitter.split( | ||
all_job_ads, all_job_ads["stratify_col"] | ||
): | ||
stratified_sample = all_job_ads.iloc[test_index] | ||
|
||
stratified_sample.drop(columns=["stratify_col"], inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't working for me. I get the error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/elizabethgallagher/miniconda3/envs/dap_job_quality/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 1841, in split
for train, test in self._iter_indices(X, y, groups):
File "/Users/elizabethgallagher/miniconda3/envs/dap_job_quality/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 2258, in _iter_indices
raise ValueError(
ValueError: The test_size = 100 should be greater or equal to the number of classes = 258
it also looks like the for loop is only saving the final loop results for stratified_sample
? I'm not used to using StratifiedShuffleSplit
though so maybe I've misunderstood something?
I was wondering whether the following would work instead though? Since you have already provided the stratification by creating the stratify_col
column, I'm not sure what StratifiedShuffleSplit
is adding? so I think you just need to groupby it and sample on each group?
stratified_sample = all_job_ads.groupby('stratify_col').apply(lambda x: x.sample(min(args.sample_size, len(x))), include_groups=False).reset_index()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good question and I had to check the answer. I think it's correct to use StratifiedShuffleSplit
in this way: we've said that we want the test set to have size args.sample_size
, and then in the for loop, we just take the data that has been earmarked for the test set.
With the code you've suggested, does the include_groups=False
part mean that sample_size
= the total size (not the size of each single group)?
@@ -0,0 +1,74 @@ | |||
import time |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified and quickened if you use sentence_transformers
:
import time
import torch
from typing import List
from sentence_transformers import SentenceTransformer
MAX_LENGTH = 81 # This is the 99th percentile of N tokens in job advert sentences :)
# Function to embed sentences
def embed_sentences(
sentences: List[str],
model_name: str = "jjzha/jobbert-base-cased",
batch_size: int = 32,
device: str = "cuda" if torch.cuda.is_available() else "cpu",
):
start_time = time.time()
jobbert_model = SentenceTransformer(model_name, device=device)
jobbert_model.max_seq_length = MAX_LENGTH
embeddings = jobbert_model.encode(sentences, batch_size=batch_size)
elapsed_time = time.time() - start_time
print(f"Batch size: {batch_size}, Time taken: {elapsed_time:.2f} seconds")
return embeddings
I tested using your original function and this new one and the embeddings look the same, apart from using sentence_transformers
outputs more decimal places.
Original method embed_sentences(jobs_df["sentences"].tolist()[0:10], JOBBERT, 64)
![Screenshot 2024-06-25 at 14 48 06](https://private-user-images.githubusercontent.com/15956065/342753823-ad9de455-56a7-4b75-b188-86f6359aae8f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzNDUwNzgsIm5iZiI6MTczOTM0NDc3OCwicGF0aCI6Ii8xNTk1NjA2NS8zNDI3NTM4MjMtYWQ5ZGU0NTUtNTZhNy00Yjc1LWIxODgtODZmNjM1OWFhZThmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTIlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjEyVDA3MTkzOFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTVlZmQ4NGYzMzJmMWE4NDQ0ZjhhZmY5YjVmM2FhNThiYWMyYzJiM2JjOGEyN2YzN2MzMDkxNGE1YzEyOWVhZWImWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.KQhqtDIBbum5hCYYPT1elORwDvObdPlA_fDV5phrwL4)
jobbert_model.encode(jobs_df["sentences"].tolist()[0:10], batch_size=64)
:
The rest of your algorithm doesn't seem to mind whether the output is a tensor or a numpy array. The original method took 335.82 seconds to process 100 job adverts, and this suggested method took 164.43 seconds.
The final output from running find_job_quality.py
(jq_df_filtered
) from this sample also looks exactly the same:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This blows my mind! I thought that you could only use SentenceTransformers
with a sentence model (rather than a masking / classification etc one) 🤯 That's a pretty big time saving! Thanks Liz
dap_job_quality/pipeline/eyp/ngram_analysis/get_ngrams_and_matches.py
Outdated
Show resolved
Hide resolved
dap_job_quality/pipeline/eyp/ngram_analysis/get_ngrams_and_matches.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Liz G <[email protected]>
…ches.py Co-authored-by: Liz G <[email protected]>
Description
This PR includes the scripts we used for sampling the EYP/similar occupations data and analysing ngram matches from that data, as well as the complete inference pipeline -
find_job_quality.py
.Fixes #52 #49 #56 #58
Instructions for Reviewer
In order to test the code in this PR you need to ...
analysis/mapping_evaluation/
but I don't think you need to run these scripts - it was an analysis from a specific point in time (sprint 7)python dap_job_quality/pipeline/eyp/stratified_sample.py
(it will run in test mode by default)python dap_job_quality/pipeline/find_job_quality
(it will run in test mode by default)Please pay special attention to ...
find_job_quality.py
- this is the most important script! Also, please add suggestions for more sensible ways of doing things if you spot them. This script largely works with dataframes but maybe it would be better to use Python data structures like dicts? Not sure what's most efficient.Checklist:
notebooks/
pre-commit
and addressed any issues not automatically fixeddev
README
s