Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to train the llama-7b in a machine with two Tesla T4 GPU's using DeepSpeed integration #3784

Closed
Ragul-Ramdass opened this issue Nov 17, 2023 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@Ragul-Ramdass
Copy link

Hi
I'm trying to do a distributed training on llama-7b in a VM having two Tesla T4 GPU's using native deepspeed. I'm facing the following error "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

My current OS is ubuntu :20.04
python version: 3.10.13
model.yaml:

base_model: /root/CodeLlama-7b-Python-hf

quantization:
  bits: 4

adapter:
  type: lora

prompt:
  template: |
    ### Instruction:
    {Instruction}

    ### Context:
    {Context}

    ### Input:
    {Input}

    ### Response:

input_features:
  - name: prompt
    type: text
    preprocessing:
      max_sequence_length: 2048

output_features:
  - name: Response
    type: text
    preprocessing:
      max_sequence_length: 2048


trainer:
  type: finetune
  learning_rate: 0.0001
  batch_size: 1
  max_batch_size: 1
  gradient_accumulation_steps: 1
  enable_gradient_checkpointing: true
  epochs: 3
  learning_rate_scheduler:
    warmup_fraction: 0.01

preprocessing:
  sample_ratio: 1.0

backend:
  type: ray
  trainer:
    use_gpu: true
    strategy: deepspeed

Environment:

absl-py                       2.0.0
accelerate                    0.24.1
aiohttp                       3.8.6
aiosignal                     1.3.1
asttokens                     2.4.1
async-timeout                 4.0.3
attrs                         23.1.0
backports.functools-lru-cache 1.6.5
bitsandbytes                  0.40.2
bleach                        6.1.0
blessed                       1.20.0
blis                          0.7.11
cachetools                    5.3.2
catalogue                     2.0.10
certifi                       2023.7.22
charset-normalizer            3.3.2
click                         8.1.7
cloudpathlib                  0.16.0
comm                          0.1.4
commonmark                    0.9.1
confection                    0.1.3
cymem                         2.0.8
Cython                        3.0.5
dataclasses-json              0.6.2
datasets                      2.15.0
debugpy                       1.6.7
decorator                     5.1.1
deepspeed                     0.12.3
dill                          0.3.7
distlib                       0.3.7
entrypoints                   0.4
et-xmlfile                    1.1.0
exceptiongroup                1.1.3
executing                     2.0.1
filelock                      3.13.1
frozenlist                    1.4.0
fsspec                        2023.9.2
getdaft                       0.1.20
google-auth                   2.23.4
google-auth-oauthlib          1.1.0
gpustat                       1.1.1
grpcio                        1.59.2
h5py                          3.10.0
hjson                         3.1.0
html5lib                      1.1
huggingface-hub               0.19.4
idna                          3.4
ipykernel                     6.26.0
ipython                       8.17.2
jedi                          0.19.1
Jinja2                        3.1.2
joblib                        1.3.2
jsonschema                    4.6.2
jupyter-client                7.3.4
jupyter_core                  5.5.0
kaggle                        1.5.16
langcodes                     3.3.0
lightning-utilities           0.9.0
loguru                        0.7.2
ludwig                        0.9.dev0
lxml                          4.9.3
Markdown                      3.5.1
MarkupSafe                    2.1.3
marshmallow                   3.20.1
marshmallow-dataclass         8.5.4
marshmallow-jsonschema        0.13.0
matplotlib-inline             0.1.6
mpi4py                        3.1.4
mpmath                        1.3.0
msgpack                       1.0.7
multidict                     6.0.4
multiprocess                  0.70.15
murmurhash                    1.0.10
mypy-extensions               1.0.0
nest-asyncio                  1.5.8
networkx                      3.2.1
ninja                         1.11.1.1
nltk                          3.8.1
numpy                         1.26.2
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cudnn-cu12             8.9.2.26
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-ml-py                  12.535.133
nvidia-nccl-cu12              2.18.1
nvidia-nvjitlink-cu12         12.3.101
nvidia-nvtx-cu12              12.1.105
oauthlib                      3.2.2
openpyxl                      3.1.2
packaging                     23.2
pandas                        2.1.3
parso                         0.8.3
peft                          0.6.2
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        10.1.0
pip                           23.3
platformdirs                  3.11.0
preshed                       3.0.9
prompt-toolkit                3.0.41
protobuf                      3.20.3
psutil                        5.9.0
ptyprocess                    0.7.0
pure-eval                     0.2.2
py                            1.11.0
py-cpuinfo                    9.0.0
pyarrow                       14.0.1
pyarrow-hotfix                0.5
pyasn1                        0.5.0
pyasn1-modules                0.3.0
pydantic                      1.10.13
Pygments                      2.16.1
pynvml                        11.5.0
pyrsistent                    0.20.0
python-dateutil               2.8.2
python-slugify                8.0.1
pytz                          2023.3.post1
pyxlsb                        1.0.10
PyYAML                        6.0
pyzmq                         25.1.0
ray                           2.3.1
regex                         2023.10.3
requests                      2.31.0
requests-oauthlib             1.3.1
retry                         0.9.2
rich                          12.4.4
rsa                           4.9
sacremoses                    0.1.1
safetensors                   0.4.0
scikit-learn                  1.3.2
scipy                         1.11.3
sentencepiece                 0.1.99
setuptools                    68.0.0
six                           1.16.0
smart-open                    6.4.0
spacy                         3.7.2
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
srsly                         2.4.8
stack-data                    0.6.2
sympy                         1.12
tabulate                      0.9.0
tensorboard                   2.15.1
tensorboard-data-server       0.7.2
text-unidecode                1.3
thinc                         8.2.1
threadpoolctl                 3.2.0
tokenizers                    0.15.0
torch                         2.1.1
torchaudio                    2.1.1
torchdata                     0.7.1
torchinfo                     1.8.0
torchmetrics                  1.2.0
torchtext                     0.16.1
torchvision                   0.16.1
tornado                       6.1
tqdm                          4.66.1
traitlets                     5.13.0
transformers                  4.35.2
triton                        2.1.0
typer                         0.9.0
typing_extensions             4.8.0
typing-inspect                0.9.0
tzdata                        2023.3
urllib3                       2.1.0
virtualenv                    20.24.6
wasabi                        1.1.2
wcwidth                       0.2.10
weasel                        0.3.4
webencodings                  0.5.1
Werkzeug                      3.0.1
wheel                         0.41.2
xlrd                          2.0.1
XlsxWriter                    3.1.9
xlwt                          1.3.0
xxhash                        3.4.1
yarl                          1.9.2

Can you guide me in solving this
Thanks in advance!!

@alexsherstinsky
Copy link
Collaborator

Hi @Ragul-Ramdass -- thank you for reporting this issue and the one in #3783 -- please give us a few business days to look into it and get back to you (I left a similar message in the above mentioned issue as well). Thank you.

@alexsherstinsky alexsherstinsky self-assigned this Nov 17, 2023
@alexsherstinsky alexsherstinsky added the bug Something isn't working label Nov 17, 2023
@Ragul-Ramdass
Copy link
Author

Hi @alexsherstinsky,
Thanks for looking into it, Please let me know if you need any other information. My aim is achieving distributed training using deepspeed in ludwig, if you can suggest any work around that would also be great. Thanks

@mhabedank mhabedank closed this as not planned Won't fix, can't repro, duplicate, stale Oct 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants