Fine-tuning error in conda environment without docker image #1538

LalchandPandia · 2024-09-21T05:50:39Z

Environment

python 3.11.9
cuda 11.8
torch 2.4.0+cu118

PyTorch information

PyTorch version: 2.4.0+cu118
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.30.3
Libc version: glibc-2.31

Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-192-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe
Nvidia driver version: 550.54.14
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 43 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 1
NUMA node(s): 1
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] onnx==1.16.2
[pip3] onnxruntime==1.19.0
[pip3] pytorch-ranger==0.1.1
[pip3] torch==2.4.0+cu118
[pip3] torch-optimizer==0.3.0
[pip3] torchaudio==2.4.0+cu118
[pip3] torchmetrics==1.4.0.post0
[pip3] torchvision==0.19.0+cu118
[pip3] triton==3.0.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] pytorch-ranger 0.1.1 pypi_0 pypi
[conda] torch 2.4.0+cu118 pypi_0 pypi
[conda] torch-optimizer 0.3.0 pypi_0 pypi
[conda] torchaudio 2.4.0+cu118 pypi_0 pypi
[conda] torchmetrics 1.4.0.post0 pypi_0 pypi
[conda] torchvision 0.19.0+cu118 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi

Composer information

Composer Version: 0.24.1
Composer Commit Hash: None
CPU Model: AMD EPYC 7542 32-Core Processor
CPU Count: 32
Number of Nodes: 1
GPU Model: NVIDIA A100 80GB PCIe
GPUs per Node: 1
GPU Count: 1
CUDA Device Count: 1

-->

To reproduce

Steps to reproduce the behavior:

1.pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118 --force-reinstall
2.pip install -e .
3.cd scripts/train
4. composer train.py finetune_example/gpt2-arc-easy--cpu.yaml
It gives the following error when run on cpu : omegaconf.errors.InterpolationKeyError: Interpolation key 'global_seed' not found
5. composer train.py finetune_example/mpt-7b-arc-easy--gpu.yaml
It gives the following error when run on gpu: ValueError: Unused parameters ['global_seed'] found in cfg. Please check your yaml to ensure these parameters are necessary. Please place any variables under the variables key.
When run on gpu:

Expected behavior

The fine-tuning should work

Additional context

The text was updated successfully, but these errors were encountered:

LalchandPandia · 2024-09-22T13:48:35Z

It is resolved. In the yaml file change ${global_seed} to ${variables.global_seed}

dakinggg · 2024-09-22T23:23:24Z

Hi, thanks for the issue! Happy to accept a PR fixing this if you like, otherwise we will update it!

LalchandPandia · 2024-09-23T00:10:15Z

Hi,
Either way is fine. Would it be possible to update the README to include flash-attention (does it require alibi) version required to run the fine-tuning GPU example along with the fix?

LalchandPandia · 2024-09-23T13:56:50Z

A follow-up on the task. I can see that the input id is of the form: id(Question): - id(Options) -\n\n... \n id(Answer): . And label contains -100 for all the entries followed by the id of tokens in true answer, and then all other entries are filled with id of . My question is how do we compare the performance of the model? Do we create similar input for each of the other options where Answer is followed by id of the other options and finally compare the log-likelihood of each of the these with the log-likelihood of the input with true answer.
For example, Q1 Answer: True_Answer--->L1
Q1 Answer: Option1-->L2
Q1 Answer: Option2 -->L3
And now we compare L1, L2 and L3 to see how model does on evaluation. Here L1, L2, L3 are negative log-likelihood

dakinggg · 2024-09-24T18:02:57Z

Hi, yes, the multiple choice ICL tasks in LLM Foundry do evaluation the way you described.

LalchandPandia · 2024-09-24T19:21:34Z

Thanks for confirming that. Just a suggestion regarding the README in scripts/eval section. It will be a good to have a section where it is described how under the hood data processing happens when executing composer eval.py. The description present scripts/train/finetune_example/README.md about composer train.py is quite helpful.

dakinggg · 2024-09-30T20:46:32Z

Wondering, does this readme help? Or still missing the information you are looking for?

LalchandPandia added the bug Something isn't working label Sep 21, 2024

LalchandPandia closed this as completed Sep 22, 2024

dakinggg reopened this Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-tuning error in conda environment without docker image #1538

Fine-tuning error in conda environment without docker image #1538

LalchandPandia commented Sep 21, 2024

LalchandPandia commented Sep 22, 2024

dakinggg commented Sep 22, 2024

LalchandPandia commented Sep 23, 2024

LalchandPandia commented Sep 23, 2024

dakinggg commented Sep 24, 2024

LalchandPandia commented Sep 24, 2024

dakinggg commented Sep 30, 2024

Fine-tuning error in conda environment without docker image #1538

Fine-tuning error in conda environment without docker image #1538

Comments

LalchandPandia commented Sep 21, 2024

Environment

PyTorch information

Composer information

To reproduce

Expected behavior

Additional context

LalchandPandia commented Sep 22, 2024

dakinggg commented Sep 22, 2024

LalchandPandia commented Sep 23, 2024

LalchandPandia commented Sep 23, 2024

dakinggg commented Sep 24, 2024

LalchandPandia commented Sep 24, 2024

dakinggg commented Sep 30, 2024