Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-tuning error in conda environment without docker image #1538

Open
LalchandPandia opened this issue Sep 21, 2024 · 7 comments
Open

Fine-tuning error in conda environment without docker image #1538

LalchandPandia opened this issue Sep 21, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@LalchandPandia
Copy link

Environment

python 3.11.9
cuda 11.8
torch 2.4.0+cu118

PyTorch information

PyTorch version: 2.4.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-192-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version: 550.54.14
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] onnx==1.16.2
[pip3] onnxruntime==1.19.0
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.4.0+cu118
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.4.0+cu118
[pip3] torchmetrics==1.4.0.post0
[pip3] torchvision==0.19.0+cu118
[pip3] triton==3.0.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch 2.4.0+cu118 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 2.4.0+cu118 pypi_0 pypi
[conda] torchmetrics 1.4.0.post0 pypi_0 pypi
[conda] torchvision 0.19.0+cu118 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi

Composer information

Composer Version: 0.24.1
Composer Commit Hash: None
CPU Model: AMD EPYC 7542 32-Core Processor
CPU Count: 32
Number of Nodes: 1
GPU Model: NVIDIA A100 80GB PCIe
GPUs per Node: 1
GPU Count: 1
CUDA Device Count: 1

-->

To reproduce

Steps to reproduce the behavior:

1.pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118 --force-reinstall
2.pip install -e .
3.cd scripts/train
4. composer train.py finetune_example/gpt2-arc-easy--cpu.yaml
It gives the following error when run on cpu : omegaconf.errors.InterpolationKeyError: Interpolation key 'global_seed' not found
5. composer train.py finetune_example/mpt-7b-arc-easy--gpu.yaml
It gives the following error when run on gpu: ValueError: Unused parameters ['global_seed'] found in cfg. Please check your yaml to ensure these parameters are necessary. Please place any variables under the variables key.
When run on gpu:

Expected behavior

The fine-tuning should work

Additional context

@LalchandPandia LalchandPandia added the bug Something isn't working label Sep 21, 2024
@LalchandPandia
Copy link
Author

It is resolved. In the yaml file change ${global_seed} to ${variables.global_seed}

@dakinggg
Copy link
Collaborator

Hi, thanks for the issue! Happy to accept a PR fixing this if you like, otherwise we will update it!

@dakinggg dakinggg reopened this Sep 22, 2024
@LalchandPandia
Copy link
Author

Hi,
Either way is fine. Would it be possible to update the README to include flash-attention (does it require alibi) version required to run the fine-tuning GPU example along with the fix?

@LalchandPandia
Copy link
Author

A follow-up on the task. I can see that the input id is of the form: id(Question): - id(Options) -\n\n... \n id(Answer): . And label contains -100 for all the entries followed by the id of tokens in true answer, and then all other entries are filled with id of . My question is how do we compare the performance of the model? Do we create similar input for each of the other options where Answer is followed by id of the other options and finally compare the log-likelihood of each of the these with the log-likelihood of the input with true answer.
For example, Q1 Answer: True_Answer--->L1
Q1 Answer: Option1-->L2
Q1 Answer: Option2 -->L3
And now we compare L1, L2 and L3 to see how model does on evaluation. Here L1, L2, L3 are negative log-likelihood

@dakinggg
Copy link
Collaborator

Hi, yes, the multiple choice ICL tasks in LLM Foundry do evaluation the way you described.

@LalchandPandia
Copy link
Author

Thanks for confirming that. Just a suggestion regarding the README in scripts/eval section. It will be a good to have a section where it is described how under the hood data processing happens when executing composer eval.py. The description present scripts/train/finetune_example/README.md about composer train.py is quite helpful.

@dakinggg
Copy link
Collaborator

Wondering, does this readme help? Or still missing the information you are looking for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants