Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding another possibility to provide the num_words in order to compute this dynamically #545

Merged

Conversation

JJorgeDSIC
Copy link
Collaborator

In the context of creating the vocab using all the in-domain data plus the remaining words from other text sources, this job now can receive also a variable that comes from subtracting the in-domain words to the num_words originally specified.

i.e:


def get_vocab_prior_id(experiment_name: str, corpus_files: Dict[str, tk.Path], vocab_size: int, *kwargs) -> tk.Path:
    
    vocab_files_id = []
    corpus_files_od = {}
    for corpus_name, corpus_file in corpus_files.items():
        if "DOMAIN_TAG" in corpus_name:
            vocab_files_id.append(corpus_file)
        else:
            corpus_files_od[corpus_name] = corpus_file
    vocabulary_job = VocabularyFromTextJob(file_paths=vocab_files_id, num_words=vocab_size)
    voc_size_id = ComputeTextCorpusStatisticsJob(vocabulary_job.out_vocabulary).out_num_lines
    vocab_size_od = vocab_size - voc_size_id
    vocab_ood = get_vocab_combine_all(experiment_name, corpus_files_od, vocab_size_od)
    vocab_ood_id = ConcatenateJob([vocab_ood, vocabulary_job.out_vocabulary], zip_out=False, out_name="vocab").out
    return vocab_ood_id`

lm/vocabulary.py Outdated Show resolved Hide resolved
@JJorgeDSIC JJorgeDSIC merged commit 5bfefda into main Nov 4, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants