Script to automatically split off eval set #1525

mattyding · 2024-09-16T08:33:54Z

What changes are proposed in this pull request?

See go/sweeps-eval for more context. Corresponding MAPI changes are contained in https://github.com/databricks-mosaic/mcloud/pull/4562.

How is this tested?

Added unit tests.

The predecessor to the unit tests was this bash script I used to manually inspect the file outputs. Tests are the same as unit tests, but it invokes the scripts/... file instead of the command_utils/... one. Confirmed file outputs looked OK.
_test.txt

Everything passes:

mattyding · 2024-09-16T08:40:38Z

llmfoundry/data/finetuning/tasks.py

+            )
+            if _is_empty_or_nonexistent(dirpath=dataset_name):
+                log.error("Failed to safely load the dataset from HF Hub.")
+                raise InvalidFileExtensionError(


Moved HF safe download logic into a separate function so I could reuse it. Code is basically unchanged.

Unrelated to meat of the PR, but per Daniel request, tried refactoring further to get rid of try-except block. I don't think it can be done without significant added complexity. You want to barrier/sync before throwing this InvalidFileExtensionError; however, this must be nested within this download logic so that only rank0 encounters it. So you can't both separate out this logic and have graceful exit without try-except. :'(

mattyding · 2024-09-16T08:41:54Z

llmfoundry/command_utils/data_prep/split_eval_set.py

+import numpy as np
+from typing import Optional
+
+import composer.utils as utils


import is formatted this way due to mocking difficulties - https://bhfsteve.blogspot.com/2012/06/patching-tip-using-mocks-in-python-unit.html

Co-authored-by: v-chen_data <[email protected]>

…p in attention and lm_head logits. (#1374)

Co-authored-by: v-chen_data <[email protected]>

Co-authored-by: v-chen_data <[email protected]> Co-authored-by: Daniel King <[email protected]>

Co-authored-by: Eitan Turok <[email protected]>

Co-authored-by: Eitan Turok <[email protected]> Co-authored-by: Mihir Patel <[email protected]>

Co-authored-by: v-chen_data <[email protected]>

Co-authored-by: Eitan Turok <[email protected]>

… split-eval-set

mattyding added 5 commits September 12, 2024 13:31

refactor hf download

fe27b8d

split_eval_set skeleton

18859b1

splitting script

f29ef67

error handling and testing

3d9d51f

undo autoformat

4e7b357

mattyding commented Sep 16, 2024

View reviewed changes

KuuCi and others added 24 commits September 16, 2024 20:54

Replace FSDP args (#1517)

83ab9c3

Co-authored-by: v-chen_data <[email protected]>

enable correct padding_idx for embedding layers (#1527)

0114f33

Revert "Replace FSDP args" (#1533)

9a1b78b

Delete unneeded inner base model in PEFT HF Checkpointer (#1532)

7a23f60

Add deprecation warning to fsdp_config (#1530)

2e3d14f

Co-authored-by: v-chen_data <[email protected]>

Fix reuse kv cache for torch attention (#1539)

d7c7822

Error on text dataset file not found (#1534)

14cff66

Make ICL tasks not required for eval (#1540)

a2c0507

Bumping flash attention version to 2.6.3 and adding option for softca…

85403c0

…p in attention and lm_head logits. (#1374)

Register mosaic logger (#1542)

f377090

Hfcheckpointer optional generation config (#1543)

d85c83b

Co-authored-by: v-chen_data <[email protected]>

Bump composer version to 0.25.0 (#1546)

275a2a4

Bump streaming version to 0.9.0 (#1550)

151a2e2

Bump version to 0.13.0.dev0 (#1549)

722526d

Add proper user error for accessing schema (#1548)

c786def

Co-authored-by: v-chen_data <[email protected]>

Validate Cluster Access Mode (#1551)

e6b8d14

Co-authored-by: v-chen_data <[email protected]> Co-authored-by: Daniel King <[email protected]>

Update mcli yamls (#1552)

dc58bb7

Use allenai/c4 instead of c4 dataset (#1554)

3b1fc4a

Co-authored-by: Eitan Turok <[email protected]>

Tensor Parallelism (#1521)

ee45600

Co-authored-by: Eitan Turok <[email protected]> Co-authored-by: Mihir Patel <[email protected]>

Insufficient Permissions Error when trying to access table (#1555)

107d246

Co-authored-by: v-chen_data <[email protected]>

Add NoOp optimizer (#1560)

4202a06

Deterministic GCRP Errors (#1559)

0ad6ab4

Co-authored-by: v-chen_data <[email protected]>

Simplify CL API (#1510)

bdc58b3

Reapply #1389 (#1561)

30cdd67

b-chu and others added 14 commits October 1, 2024 11:14

Add dataset swap callback (#1536)

ec4cafd

Add error to catch more unknown example types (#1562)

b517297

Add FileExtensionNotFoundError (#1564)

8cf3d87

Add InvalidConversationError (#1565)

a462f03

Release docker img (#1547)

24fec79

Co-authored-by: v-chen_data <[email protected]>

Revert FT dataloader changes from #1561, keep #1564 (#1566)

214305f

Cleanup TP (#1556)

4bbb4a5

Co-authored-by: Eitan Turok <[email protected]>

Changes for dataset swap callback (#1569)

788c1f5

refactor hf download

56e4573

split_eval_set skeleton

983a32d

splitting script

d3d587d

error handling and testing

b921b30

undo autoformat

7925001

Merge branch 'split-eval-set' of github.com:mosaicml/llm-foundry into…

7b47ae6

… split-eval-set

mattyding closed this Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script to automatically split off eval set #1525

Script to automatically split off eval set #1525

mattyding commented Sep 16, 2024 •

edited

Loading

mattyding Sep 16, 2024

mattyding Sep 16, 2024

Script to automatically split off eval set #1525

Script to automatically split off eval set #1525

Conversation

mattyding commented Sep 16, 2024 • edited Loading

What changes are proposed in this pull request?

How is this tested?

mattyding Sep 16, 2024

Choose a reason for hiding this comment

mattyding Sep 16, 2024

Choose a reason for hiding this comment

mattyding commented Sep 16, 2024 •

edited

Loading