Adding another possibility to provide the num_words in order to compute this dynamically #545

JJorgeDSIC · 2024-09-24T14:23:03Z

In the context of creating the vocab using all the in-domain data plus the remaining words from other text sources, this job now can receive also a variable that comes from subtracting the in-domain words to the num_words originally specified.

i.e:


def get_vocab_prior_id(experiment_name: str, corpus_files: Dict[str, tk.Path], vocab_size: int, *kwargs) -> tk.Path:
    
    vocab_files_id = []
    corpus_files_od = {}
    for corpus_name, corpus_file in corpus_files.items():
        if "DOMAIN_TAG" in corpus_name:
            vocab_files_id.append(corpus_file)
        else:
            corpus_files_od[corpus_name] = corpus_file
    vocabulary_job = VocabularyFromTextJob(file_paths=vocab_files_id, num_words=vocab_size)
    voc_size_id = ComputeTextCorpusStatisticsJob(vocabulary_job.out_vocabulary).out_num_lines
    vocab_size_od = vocab_size - voc_size_id
    vocab_ood = get_vocab_combine_all(experiment_name, corpus_files_od, vocab_size_od)
    vocab_ood_id = ConcatenateJob([vocab_ood, vocabulary_job.out_vocabulary], zip_out=False, out_name="vocab").out
    return vocab_ood_id`

…te this dynamically

lm/vocabulary.py

Co-authored-by: michelwi <[email protected]>

Adding another possibility to provide the num_words in order to compu…

e2796a6

…te this dynamically

michelwi reviewed Sep 24, 2024

View reviewed changes

lm/vocabulary.py Outdated Show resolved Hide resolved

michelwi requested review from christophmluscher, NeoLegends, JackTemaki, michelwi and Atticus1806 September 24, 2024 14:44

Update lm/vocabulary.py

10ebe79

Co-authored-by: michelwi <[email protected]>

JackTemaki approved these changes Oct 8, 2024

View reviewed changes

michelwi approved these changes Oct 8, 2024

View reviewed changes

JJorgeDSIC merged commit 5bfefda into main Nov 4, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding another possibility to provide the num_words in order to compute this dynamically #545

Adding another possibility to provide the num_words in order to compute this dynamically #545

JJorgeDSIC commented Sep 24, 2024

Adding another possibility to provide the num_words in order to compute this dynamically #545

Adding another possibility to provide the num_words in order to compute this dynamically #545

Conversation

JJorgeDSIC commented Sep 24, 2024