Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glm-4-9b-chat-hf 微调报错 #615

Open
1 of 2 tasks
ajsterling123 opened this issue Oct 28, 2024 · 1 comment
Open
1 of 2 tasks

glm-4-9b-chat-hf 微调报错 #615

ajsterling123 opened this issue Oct 28, 2024 · 1 comment
Assignees

Comments

@ajsterling123
Copy link

System Info / 系統信息

torch 2.4.0
torchvision 0.19.0
tqdm 4.66.6
transformers 4.46.0

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

glm4_finetune) ai@ai-MS-7D27:~/PycharmProjects/pythonProject/GLM-4-main/finetune_demo$ OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc_per_node=2 finetune.py data /home/ai/.cache/modelscope/hub/ZhipuAI/glm-4-9b-chat-hf configs/lora.yaml
[rank0]: ╭────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────────────────────╮
[rank0]: │ /home/ai/PycharmProjects/pythonProject/GLM-4-main/finetune_demo/finetune.py:406 in main │
[rank0]: │ │
[rank0]: │ 403 │ │ ), │
[rank0]: │ 404 ): │
[rank0]: │ 405 │ ft_config = FinetuningConfig.from_file(config_file) │
[rank0]: │ ❱ 406 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │
[rank0]: │ 407 │ data_manager = DataManager(data_dir, ft_config.data_config) │
[rank0]: │ 408 │ │
[rank0]: │ 409 │ train_dataset = data_manager.get_dataset( │
[rank0]: │ │
[rank0]: │ /home/ai/PycharmProjects/pythonProject/GLM-4-main/finetune_demo/finetune.py:353 in load_tokenizer_and_model │
[rank0]: │ │
[rank0]: │ 350 ): │
[rank0]: │ 351 │ tokenizer = AutoTokenizer.from_pretrained(model_dir, trust_remote_code=True) │
[rank0]: │ 352 │ if peft_config is not None: │
[rank0]: │ ❱ 353 │ │ model = AutoModelForCausalLM.from_pretrained( │
[rank0]: │ 354 │ │ │ model_dir, │
[rank0]: │ 355 │ │ │ trust_remote_code=True, │
[rank0]: │ 356 │ │ │ empty_init=False, │
[rank0]: │ │
[rank0]: │ /home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py:564 in from_pretrained │
[rank0]: │ │
[rank0]: │ 561 │ │ │ ) │
[rank0]: │ 562 │ │ elif type(config) in cls._model_mapping.keys(): │
[rank0]: │ 563 │ │ │ model_class = _get_model_class(config, cls._model_mapping) │
[rank0]: │ ❱ 564 │ │ │ return model_class.from_pretrained( │
[rank0]: │ 565 │ │ │ │ pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, │
[rank0]: │ 566 │ │ │ ) │
[rank0]: │ 567 │ │ raise ValueError( │
[rank0]: │ │
[rank0]: │ /home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/transformers/modeling_utils.py:4096 in from_pretrained │
[rank0]: │ │
[rank0]: │ 4093 │ │ │
[rank0]: │ 4094 │ │ with ContextManagers(init_contexts): │
[rank0]: │ 4095 │ │ │ # Let's make sure we don't run the init function of buffer modules │
[rank0]: │ ❱ 4096 │ │ │ model = cls(config, *model_args, **model_kwargs) │
[rank0]: │ 4097 │ │ │
[rank0]: │ 4098 │ │ # make sure we use the model's config since the init call might have copied │
[rank0]: │ 4099 │ │ config = model.config │
[rank0]: ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
[rank0]: TypeError: GlmForCausalLM.init() got an unexpected keyword argument 'empty_init'
[rank0]:[W1028 21:09:06.090629288 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1028 21:09:06.562000 139893055230464 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 22789 closing signal SIGTERM
E1028 21:09:06.675000 139893055230464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 22788) of binary: /home/ai/anaconda3/envs/glm4_finetune/bin/python
Traceback (most recent call last):
File "/home/ai/anaconda3/envs/glm4_finetune/bin/torchrun", line 8, in
sys.exit(main())
^^^^^^
File "/home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ai/anaconda3/envs/glm4_finetune/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

finetune.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-28_21:09:06
host : ai-MS-7D27
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 22788)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Expected behavior / 期待表现

是transform的版本问题吗?

@zRzRzRzRzRzRzR
Copy link
Member

微调代码还没测试到4.46版本,我们会尽快跟进上,主仓库将逐渐升级为4.46版本,现在从infer开始

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants