Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Example] Fix llm_finetune example json part #3461

Merged
merged 1 commit into from
Jul 13, 2023

Conversation

chongxiaoc
Copy link
Contributor

llm_finetune example is using out-dated json format.
This PR should fix it.

However, my testing env is using Ray 2.5 and still get failure in model training part.

@chongxiaoc chongxiaoc changed the title [Example] Fix llm_finetune example [Example] Fix llm_finetune example json part Jul 13, 2023
@arnavgarg1
Copy link
Contributor

Hey! Thanks for the contribution.

What was the issue you ran into with using the LLM as an encoder? It should work as is as long as the model type is ECD.

Currently Ludwig master only supports Ray 2.4.

@chongxiaoc
Copy link
Contributor Author

chongxiaoc commented Jul 13, 2023

What was the issue you ran into with using the LLM as an encoder? It should work as is as long as the model type is ECD.

Json validation error.
In this case, maybe this line is only necessary:

trainer:
  type: finetune

@chongxiaoc chongxiaoc force-pushed the fix_llm_example branch 2 times, most recently from 634f20f to 85f4b42 Compare July 13, 2023 00:24
@github-actions
Copy link

github-actions bot commented Jul 13, 2023

Unit Test Results

  4 files  ±0    4 suites  ±0   55m 4s ⏱️ -51s
34 tests ±0  27 ✔️  - 2    7 💤 +2  0 ±0 
68 runs  ±0  54 ✔️  - 4  14 💤 +4  0 ±0 

Results for commit bcdf349. ± Comparison against base commit 60f1416.

This pull request skips 2 tests.
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[ames_housing.gbm.yaml]
tests.regression_tests.benchmark.test_model_performance ‑ test_performance[mercedes_benz_greener.gbm.yaml]

♻️ This comment has been updated with latest results.

@arnavgarg1
Copy link
Contributor

Hmm, I'm not sure you need to add trainer: type finetune here since that is specifically reserved for LLM model types. In this case, the config defaults to ECD, and it just has one type of trainer.

What is the exact issue you're running into with the existing config?

@arnavgarg1 arnavgarg1 self-requested a review July 13, 2023 15:56
@chongxiaoc
Copy link
Contributor Author

chongxiaoc commented Jul 13, 2023

@arnavgarg1
the error is from lora field in this example, so I have to convert it to llm train case. (reverted my changes since ecd is not working)

The above exception was the direct cause of the following exception:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /repo/llm_ray/train_imdb_ray.py:43 in <module>                                                   │
│                                                                                                  │
│   40 )                                                                                           │
│   41                                                                                             │
│   42 # Define Ludwig model object that drive model training                                      │
│ ❱ 43 model = LudwigModel(config=config, logging_level=logging.INFO)                              │
│   44                                                                                             │
│   45 # initiate model training                                                                   │
│   46 (                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/ludwig/api.py:312 in __init__                             │
│                                                                                                  │
│    309 │   │   self._user_config = upgrade_config_dict_to_latest_version(config_dict)            │
│    310 │   │                                                                                     │
│    311 │   │   # Initialize the config object                                                    │
│ ❱  312 │   │   self.config_obj = ModelConfig.from_dict(self._user_config)                        │
│    313 │   │                                                                                     │
│    314 │   │   # setup logging                                                                   │
│    315 │   │   self.set_logging_level(logging_level)                                             │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/ludwig/schema/model_types/base.py:136 in from_dict        │
│                                                                                                  │
│   133 │   │   # TODO: Set `additionalProperties=False` for all Ludwig schema, and look into pa   │
│   134 │   │   # marshmallow.load(), which raises an error for unknown fields during deserializ   │
│   135 │   │   # https://marshmallow.readthedocs.io/en/stable/marshmallow.schema.html#marshmall   │
│ ❱ 136 │   │   check_schema(config)                                                               │
│   137 │   │                                                                                      │
│   138 │   │   cls = model_type_schema_registry[model_type]                                       │
│   139 │   │   schema = cls.get_class_schema()()                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.9/dist-packages/ludwig/config_validation/validation.py:76 in check_schema │
│                                                                                                  │
│   73 │   │   │   error = e                                                                       │
│   74 │                                                                                           │
│   75 │   if error is not None:                                                                   │
│ ❱ 76 │   │   raise ConfigValidationError(f"Failed to validate JSON schema for config. Error:     │
│   77                                                                                             │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ConfigValidationError: Failed to validate JSON schema for config. Error: 'lora' is not of type 'object'

@chongxiaoc
Copy link
Contributor Author

chongxiaoc commented Jul 13, 2023

I got a Ray 2.3 docker image, with above config change tollm, it failed in padding as below:

ray.exceptions.RayTaskError(IndexError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=3597, ip=10.78.204.13, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f24880c83a0>)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 479, in <lambda>
    lambda config: train_fn(**config),
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 206, in train_fn
    results = trainer.train(train_shard, val_shard, test_shard, return_state_dict=True, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/distributed/base.py", line 152, in wrapped
    res = fn(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 761, in train
    should_break = self._train_loop(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 891, in _train_loop
    loss, all_losses = self.train_step(inputs, targets, should_step=should_step)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 263, in train_step
    model_outputs = self.dist_model((inputs, targets))
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1724, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/models/llm.py", line 288, in forward
    self.model_inputs, self.attention_masks = generate_merged_ids(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/utils/llm_utils.py", line 282, in generate_merged_ids
    target_id_sample_no_padding = remove_left_padding(target_id_sample, tokenizer)[0]
  File "/usr/local/lib/python3.9/dist-packages/ludwig/utils/llm_utils.py", line 99, in remove_left_padding
    input_ids_no_bos = input_ids_no_padding[bos_idx:].unsqueeze(0)
IndexError: Dimension specified as 0 but tensor has no dimensions

@chongxiaoc
Copy link
Contributor Author

Convert the change back to ecd mode, by moving lora outside.
Ray 2.3 got below error on 1 GPU:

ray.exceptions.RayTaskError(AssertionError): ray::RayTrainWorker._RayTrainWorker__execute() (pid=27661, ip=10.78.204.13, repr=<ray.train._internal.worker_group.RayTrainWorker object at 0x7f227c0c83d0>)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/worker_group.py", line 31, in __execute
    raise skipped from exception_cause(skipped)
  File "/usr/local/lib/python3.9/dist-packages/ray/train/_internal/utils.py", line 129, in discard_return_wrapper
    train_func(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 479, in <lambda>
    lambda config: train_fn(**config),
  File "/usr/local/lib/python3.9/dist-packages/ludwig/backend/ray.py", line 200, in train_fn
    trainer = remote_trainer_cls(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 1309, in __init__
    super().__init__(**kwargs)
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 191, in __init__
    self.prepare()
  File "/usr/local/lib/python3.9/dist-packages/ludwig/trainers/trainer.py", line 210, in prepare
    self.dist_model, self.optimizer = self.distributed.prepare(
  File "/usr/local/lib/python3.9/dist-packages/ludwig/distributed/deepspeed.py", line 101, in prepare
    model_engine, optimizer, _, _ = deepspeed.initialize(
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/__init__.py", line 165, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 308, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1173, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/engine.py", line 1463, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer_Stage3(
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage3.py", line 256, in __init__
    self._create_fp16_partitions_with_defragmentation(self.trainable_param_groups)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage3.py", line 600, in _create_fp16_partitions_with_defragmentation
    device_buffer = __class__.defragment(parameter_partitions)
  File "/usr/local/lib/python3.9/dist-packages/deepspeed/runtime/zero/stage3.py", line 436, in defragment
    assert len(set(t.device for t in tensors)) == 1
AssertionError

@arnavgarg1
Copy link
Contributor

@chongxiaoc can you potentially try bumping Deepspeed to 0.9.2 and seeing if this error persists? From what I remember, this error used to show up on older versions of deepspeed

@chongxiaoc
Copy link
Contributor Author

@arnavgarg1 Just adding type for lora fixes it.

See training running now

2023-07-13 21:37:28,897 WARNING util.py:244 -- The `callbacks.on_trial_result` operation took 0.669 s, which may be a performance bottleneck.
2023-07-13 21:37:28,932 WARNING util.py:244 -- The `process_trial_result` operation took 1.328 s, which may be a performance bottleneck.
2023-07-13 21:37:28,932 WARNING util.py:244 -- Processing trial results took 1.328 s, which may be a performance bottleneck. Please consider reporting results less frequently to Ray Tune.
2023-07-13 21:37:28,932 WARNING util.py:244 -- The `process_trial_result` operation took 1.485 s, which may be a performance bottleneck.
Training:   5%|████▉   

@arnavgarg1
Copy link
Contributor

Nice! So deepspeed isn't a problem then? What version of deepspeed are you on?

Thanks for working through this so patiently!

@chongxiaoc
Copy link
Contributor Author

chongxiaoc commented Jul 13, 2023

Nice! So deepspeed isn't a problem then? What version of deepspeed are you on?

Thanks for working through this so patiently!

deepspeed==0.9.2. This is version I have been using since opening this PR.

@arnavgarg1
Copy link
Contributor

Perfect, thanks!

@arnavgarg1 arnavgarg1 merged commit 5c92f62 into ludwig-ai:master Jul 13, 2023
11 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants