[Example] Fix llm_finetune example json part #3461

chongxiaoc · 2023-07-13T00:07:22Z

llm_finetune example is using out-dated json format.
This PR should fix it.

However, my testing env is using Ray 2.5 and still get failure in model training part.

arnavgarg1 · 2023-07-13T00:12:10Z

Hey! Thanks for the contribution.

What was the issue you ran into with using the LLM as an encoder? It should work as is as long as the model type is ECD.

Currently Ludwig master only supports Ray 2.4.

chongxiaoc · 2023-07-13T00:16:07Z

What was the issue you ran into with using the LLM as an encoder? It should work as is as long as the model type is ECD.

Json validation error.
In this case, maybe this line is only necessary:

trainer:
  type: finetune

github-actions · 2023-07-13T01:26:51Z

Unit Test Results

  4 files ±0   4 suites ±0 55m 4s ⏱️ -51s
34 tests ±0 27 ✔️ - 2   7 💤 +2 0 ❌ ±0
68 runs ±0 54 ✔️ - 4 14 💤 +4 0 ❌ ±0

Results for commit bcdf349. ± Comparison against base commit 60f1416.

This pull request skips 2 tests.

tests.regression_tests.benchmark.test_model_performance ‑ test_performance[ames_housing.gbm.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[mercedes_benz_greener.gbm.yaml]

♻️ This comment has been updated with latest results.

arnavgarg1 · 2023-07-13T15:56:07Z

Hmm, I'm not sure you need to add trainer: type finetune here since that is specifically reserved for LLM model types. In this case, the config defaults to ECD, and it just has one type of trainer.

What is the exact issue you're running into with the existing config?

chongxiaoc · 2023-07-13T17:43:51Z

@arnavgarg1
the error is from lora field in this example, so I have to convert it to llm train case. (reverted my changes since ecd is not working)

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /repo/llm_ray/train_imdb_ray.py:43 in <module>                                                   │
│                                                                                                  │
│   40 )                                                                                           │
│   41                                                                                             │
│   42 # Define Ludwig model object that drive model training                                      │
│ ❱ 43 model = LudwigModel(config=config, logging_level=logging.INFO)                              │
│   44                                                                                             │
│   45 # initiate model training                                                                   │
│   46 (                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/ludwig/api.py:312 in __init__                             │
│                                                                                                  │
│    309 │   │   self._user_config = upgrade_config_dict_to_latest_version(config_dict)            │
│    310 │   │                                                                                     │
│    311 │   │   # Initialize the config object                                                    │
│ ❱  312 │   │   self.config_obj = ModelConfig.from_dict(self._user_config)                        │
│    313 │   │                                                                                     │
│    314 │   │   # setup logging                                                                   │
│    315 │   │   self.set_logging_level(logging_level)                                             │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/ludwig/schema/model_types/base.py:136 in from_dict        │
│                                                                                                  │
│   133 │   │   # TODO: Set `additionalProperties=False` for all Ludwig schema, and look into pa   │
│   134 │   │   # marshmallow.load(), which raises an error for unknown fields during deserializ   │
│   135 │   │   # https://marshmallow.readthedocs.io/en/stable/marshmallow.schema.html#marshmall   │
│ ❱ 136 │   │   check_schema(config)                                                               │
│   137 │   │                                                                                      │
│   138 │   │   cls = model_type_schema_registry[model_type]                                       │
│   139 │   │   schema = cls.get_class_schema()()                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/ludwig/config_validation/validation.py:76 in check_schema │
│                                                                                                  │
│   73 │   │   │   error = e                                                                       │
│   74 │                                                                                           │
│   75 │   if error is not None:                                                                   │
│ ❱ 76 │   │   raise ConfigValidationError(f"Failed to validate JSON schema for config. Error:     │
│   77                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ConfigValidationError: Failed to validate JSON schema for config. Error: 'lora' is not of type 'object'

chongxiaoc · 2023-07-13T18:14:17Z

I got a Ray 2.3 docker image, with above config change tollm, it failed in padding as below:

ray.exceptions.RayTaskError(IndexError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3597, ip=10.78.204.13, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f24880c83a0>)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 479, in <lambda>
    lambda config: train_fn(**config),
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 206, in train_fn
    results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/distributed/base.py", line 152, in wrapped
    res = fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 761, in train
    should_break = self._train_loop(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 891, in _train_loop
    loss, all_losses = self.train_step(inputs, targets, should_step=should_step)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 263, in train_step
    model_outputs = self.dist_model((inputs, targets))
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1724, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/models/llm.py", line 288, in forward
    self.model_inputs, self.attention_masks = generate_merged_ids(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/utils/llm_utils.py", line 282, in generate_merged_ids
    target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
  File "/usr/local/lib/python3.9/dist-packages/ludwig/utils/llm_utils.py", line 99, in remove_left_padding
    input_ids_no_bos = input_ids_no_padding[bos_idx:].unsqueeze(0)
IndexError: Dimension specified as 0 but tensor has no dimensions

chongxiaoc · 2023-07-13T20:57:14Z

Convert the change back to ecd mode, by moving lora outside.
Ray 2.3 got below error on 1 GPU:

ray.exceptions.RayTaskError(AssertionError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=27661, ip=10.78.204.13, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f227c0c83d0>)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 479, in <lambda>
    lambda config: train_fn(**config),
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 200, in train_fn
    trainer = remote_trainer_cls(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 1309, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 191, in __init__
    self.prepare()
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 210, in prepare
    self.dist_model, self.optimizer = self.distributed.prepare(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/distributed/deepspeed.py", line 101, in prepare
    model_engine, optimizer, _, _ = deepspeed.initialize(
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1463, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage3.py", line 256, in __init__
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage3.py", line 600, in _create_fp16_partitions_with_defragmentation
    device_buffer = __class__.defragment(parameter_partitions)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage3.py", line 436, in defragment
    assert len(set(t.device for t in tensors)) == 1
AssertionError

arnavgarg1 · 2023-07-13T21:06:37Z

@chongxiaoc can you potentially try bumping Deepspeed to 0.9.2 and seeing if this error persists? From what I remember, this error used to show up on older versions of deepspeed

chongxiaoc · 2023-07-13T21:37:52Z

@arnavgarg1 Just adding type for lora fixes it.

See training running now

2023-07-13 21:37:28,897 WARNING util.py:244 -- The `callbacks.on_trial_result` operation took 0.669 s, which may be a performance bottleneck.
2023-07-13 21:37:28,932 WARNING util.py:244 -- The `process_trial_result` operation took 1.328 s, which may be a performance bottleneck.
2023-07-13 21:37:28,932 WARNING util.py:244 -- Processing trial results took 1.328 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
2023-07-13 21:37:28,932 WARNING util.py:244 -- The `process_trial_result` operation took 1.485 s, which may be a performance bottleneck.
Training:   5%|████▉

arnavgarg1 · 2023-07-13T21:39:56Z

Nice! So deepspeed isn't a problem then? What version of deepspeed are you on?

Thanks for working through this so patiently!

chongxiaoc · 2023-07-13T21:41:25Z

Nice! So deepspeed isn't a problem then? What version of deepspeed are you on?

Thanks for working through this so patiently!

deepspeed==0.9.2. This is version I have been using since opening this PR.

arnavgarg1 · 2023-07-13T21:41:52Z

Perfect, thanks!

chongxiaoc changed the title ~~[Example] Fix llm_finetune example~~ [Example] Fix llm_finetune example json part Jul 13, 2023

chongxiaoc force-pushed the fix_llm_example branch from 9db6213 to 4c878cf Compare July 13, 2023 00:14

chongxiaoc force-pushed the fix_llm_example branch 2 times, most recently from 634f20f to 85f4b42 Compare July 13, 2023 00:24

arnavgarg1 self-requested a review July 13, 2023 15:56

chongxiaoc force-pushed the fix_llm_example branch from 85f4b42 to bcdf349 Compare July 13, 2023 17:45

chongxiaoc force-pushed the fix_llm_example branch from bcdf349 to 6ffb182 Compare July 13, 2023 20:56

[Example] Fix llm_finetune example json part

fd3cc43

chongxiaoc force-pushed the fix_llm_example branch from 6ffb182 to fd3cc43 Compare July 13, 2023 21:04

arnavgarg1 merged commit 5c92f62 into ludwig-ai:master Jul 13, 2023
11 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Example] Fix llm_finetune example json part #3461

[Example] Fix llm_finetune example json part #3461

chongxiaoc commented Jul 13, 2023

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 •

edited

Loading

github-actions bot commented Jul 13, 2023 •

edited

Loading

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 •

edited

Loading

chongxiaoc commented Jul 13, 2023 •

edited

Loading

chongxiaoc commented Jul 13, 2023

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 •

edited

Loading

arnavgarg1 commented Jul 13, 2023

[Example] Fix llm_finetune example json part #3461

[Example] Fix llm_finetune example json part #3461

Conversation

chongxiaoc commented Jul 13, 2023

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 • edited Loading

github-actions bot commented Jul 13, 2023 • edited Loading

Unit Test Results

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 • edited Loading

chongxiaoc commented Jul 13, 2023 • edited Loading

chongxiaoc commented Jul 13, 2023

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 • edited Loading

arnavgarg1 commented Jul 13, 2023

chongxiaoc commented Jul 13, 2023 •

edited

Loading

github-actions bot commented Jul 13, 2023 •

edited

Loading

chongxiaoc commented Jul 13, 2023 •

edited

Loading

chongxiaoc commented Jul 13, 2023 •

edited

Loading

chongxiaoc commented Jul 13, 2023 •

edited

Loading