Math RL data preparation #368

Kipok · 2025-02-09T01:46:35Z

Breaking change: prepare_sft_data is renamed to prepare_data as it now covers more cases

Signed-off-by: Igor Gitman <[email protected]>

shtoshni · 2025-02-10T02:57:53Z

nemo_skills/utils.py

@@ -350,3 +350,14 @@ def compute_chunk_ids(chunk_ids: list[int] | str, num_chunks: int) -> list[int]
        assert chunk_id >= 0, "Run ids should have 1-based indexing"

    return chunk_ids
+
+
+def prefill_judgement(data_point: dict) -> str | None:


Should we keep this function here or this is more appropriate for training? Maybe a utils.py for training folder makes sense.

So this is supposed to be a basic judge model which only cares about the surface form, and should be task agnostic. Is this preferred because it's lightweight and generic?

shtoshni · 2025-02-10T02:59:17Z

tests/test_data_preparation.py

@@ -171,4 +171,4 @@ def test_openmathinstruct2():

    assert (
        expected_md5 == output_md5
-    ), "MD5 hashes do not match, something is wrong with nemo_skills/finetuning/prepare_sft_data.py"
+    ), "MD5 hashes do not match, something is wrong with nemo_skills/finetuning/prepare_data.py"


Suggested change

), "MD5 hashes do not match, something is wrong with nemo_skills/finetuning/prepare_data.py"

), "MD5 hashes do not match, something is wrong with nemo_skills/training/prepare_data.py"

shtoshni · 2025-02-10T03:02:51Z

nemo_skills/training/data_preparation_utils/math_rl.yaml

@@ -0,0 +1,90 @@
+processors_to_run: all


To avoid clutter we can move all the config files in a config folder.

shtoshni · 2025-02-10T03:04:19Z

nemo_skills/training/data_preparation_utils/math_rl.yaml

+      - "majority_votes"
+
+  # this will optimize processors inside to avoid serializing data to disk
+  - _target_: nemo_skills.training.data_preparation_utils.merge_processor.MergeProcessor


This might be a good addition for math_sft.yaml and code_sft.yaml as well. We can do it later but good to know of this functionality.

shtoshni · 2025-02-10T15:40:48Z

nemo_skills/training/data_preparation_utils/preprocessing.py

+                # take only required keys from the input if exclude_optional_keys is True
+                output_sample = {}
+                if not self.exclude_optional_keys:
+                    output_sample = json.loads(line)


This if/elif structure is testing for two very different things. Is there a missing else for the first if? Can the second elif be an if in itself?

shtoshni · 2025-02-10T15:45:29Z

nemo_skills/training/openrlhf/llm_reward.py

+        if judgement is not None:
+            prefilled_judgements.append(judgement)
+            prefilled_indices.add(len(data_points) - 1)
+
    llm = get_model(server_type="trtllm")
    prompt = get_prompt('judge/math', 'qwen-instruct')


server_type, prompt_config, and prompt_template should not be hardcoded in the function. Make them parameters ideally.

shtoshni · 2025-02-10T15:49:23Z

nemo_skills/training/openrlhf/llm_reward.py

+                "predicted_answer": extract_answer(query),
+            }
+        )
+        judgement = prefill_judgement(data_points[-1])


Prefiling can ideally save computation but right now these data points are still being passed to the LLM. For cases where the judgement is True/Correct, we can remove those. For the ones where it is None or False, in those cases we can rely on the LLM as a judge.

shtoshni · 2025-02-10T15:51:41Z

nemo_skills/training/openrlhf/llm_reward.py

 from nemo_skills.code_execution.math_grader import extract_answer
 from nemo_skills.evaluation.metrics.utils import is_correct_judgement
+from nemo_skills.inference.server.model import get_model
+from nemo_skills.prompt.utils import get_prompt
+from nemo_skills.utils import prefill_judgement


 def reward_func(queries: list[str], prompts: list[str], prompt_metadata: list[dict]):


A one-line description that the function assigns a binary score to the prompt-response/problem-response pairs.

shtoshni · 2025-02-10T15:55:47Z

nemo_skills/training/openrlhf/llm_reward.py

 from nemo_skills.code_execution.math_grader import extract_answer
 from nemo_skills.evaluation.metrics.utils import is_correct_judgement
+from nemo_skills.inference.server.model import get_model
+from nemo_skills.prompt.utils import get_prompt
+from nemo_skills.utils import prefill_judgement


 def reward_func(queries: list[str], prompts: list[str], prompt_metadata: list[dict]):


I am confused by the parameter prompts. It's passed as a parameter, and then the variable prompts is created inside the function on line 29. The other variables could also use better names. For e.g., queries could be generations/responses.

Kipok added 24 commits February 7, 2025 14:41

Add dummy custom reward

5cdd2cd

Signed-off-by: Igor Gitman <[email protected]>

Add debugging print

f9bd7a9

Signed-off-by: Igor Gitman <[email protected]>

To tensor

6f93f86

Signed-off-by: Igor Gitman <[email protected]>

Fixes

14c6809

Signed-off-by: Igor Gitman <[email protected]>

debugging

cca761d

Signed-off-by: Igor Gitman <[email protected]>

Cleanup

f050d6d

Signed-off-by: Igor Gitman <[email protected]>

Merge branch 'main' into igitman/math-llm-judge-openrlhf

bb739e3

Move to fork

157096e

Signed-off-by: Igor Gitman <[email protected]>

Add prompt metadata

fec1f7d

Signed-off-by: Igor Gitman <[email protected]>

Cleanup

0b50f15

Signed-off-by: Igor Gitman <[email protected]>

Merge branch 'main' into igitman/math-llm-judge-openrlhf

7f91512

Rename prepare_sft_data -> prepare_training_data

c48ac6d

Signed-off-by: Igor Gitman <[email protected]>

More renaming

30b3651

Signed-off-by: Igor Gitman <[email protected]>

Change ReadData to allow problems only

3f768fe

Signed-off-by: Igor Gitman <[email protected]>

Add output file writing for rl

869dfbe

Signed-off-by: Igor Gitman <[email protected]>

Merge branch 'main' into igitman/rl-data-prep

cc12c6b

Add filter for no answer

0e4c632

Signed-off-by: Igor Gitman <[email protected]>

Adjust parameters

f309dc9

Signed-off-by: Igor Gitman <[email protected]>

Cleanup

57cb7b9

Signed-off-by: Igor Gitman <[email protected]>

Add prefilled judgement logic

7f54c95

Signed-off-by: Igor Gitman <[email protected]>

Fix imports

29af5b3

Signed-off-by: Igor Gitman <[email protected]>

Add stop phrases

c394c33

Signed-off-by: Igor Gitman <[email protected]>

Cleanup

61966ef

Signed-off-by: Igor Gitman <[email protected]>

Fix docs

e1a389b

Signed-off-by: Igor Gitman <[email protected]>

Kipok requested a review from shtoshni February 9, 2025 01:46

shtoshni reviewed Feb 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Math RL data preparation #368

Math RL data preparation #368

Kipok commented Feb 9, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

shtoshni Feb 10, 2025

	), "MD5 hashes do not match, something is wrong with nemo_skills/finetuning/prepare_data.py"
	), "MD5 hashes do not match, something is wrong with nemo_skills/training/prepare_data.py"

Math RL data preparation #368

Are you sure you want to change the base?

Math RL data preparation #368

Conversation

Kipok commented Feb 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment