Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing folder exception with Google Cloud Storage checkpointing #18044

Closed
bilelomrani1 opened this issue Jul 10, 2023 · 4 comments · Fixed by #18088
Closed

Missing folder exception with Google Cloud Storage checkpointing #18044

bilelomrani1 opened this issue Jul 10, 2023 · 4 comments · Fixed by #18088
Labels
3rd party Related to a 3rd-party bug Something isn't working help wanted Open to be worked on logger: csv ver: 2.0.x

Comments

@bilelomrani1
Copy link
Contributor

bilelomrani1 commented Jul 10, 2023

Bug description

The training fails when a GCP bucket is chosen as the Trainer's default_root_dir. I am properly logged in using gcloud auth application-default login and the correct GCP project is set using gcloud config set project <my-project>. I have no problem with reading or writing to the same bucket using the console or other applications.

What version are you seeing the problem on?

v2.0

How to reproduce the bug

import pytorch_lightning as pl
from pytorch_lightning.demos.boring_classes import BoringDataModule, BoringModel

if __name__ == "__main__":
    trainer = pl.Trainer(
        max_epochs=1,
        default_root_dir="gs://llm-bucket/checkpoints/debug/",
        limit_train_batches=8,
        limit_val_batches=8,
    )
    trainer.fit(BoringModel(), datamodule=BoringDataModule())

Error messages and logs

[2023-07-10 15:30:07,988] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:67: UserWarning: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `pytorch_lightning` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
  warning_cache.warn(
Missing logger folder: gs://llm-bucket/checkpoints/debug/lightning_logs

  | Name  | Type   | Params
---------------------------------
0 | layer | Linear | 66    
---------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
Traceback (most recent call last):
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 570, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 958, in _run
    _log_hyperparams(self)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/loggers/utilities.py", line 94, in _log_hyperparams
    logger.save()
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/lightning_fabric/loggers/csv_logs.py", line 141, in save
    self.experiment.save()
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/loggers/csv_logs.py", line 61, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 292, in save_hparams_to_yaml
    raise RuntimeError(f"Missing folder: {os.path.dirname(config_yaml)}.")
RuntimeError: Missing folder: gs://llm-bucket/checkpoints/debug/lightning_logs/version_0.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/mre_pt_gcp.py", line 11, in <module>
    trainer.fit(BoringModel(), datamodule=BoringDataModule())
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 531, in fit
    call._call_and_handle_interrupt(
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 65, in _call_and_handle_interrupt
    logger.finalize("failed")
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/lightning_fabric/loggers/csv_logs.py", line 149, in finalize
    self.save()
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/lightning_utilities/core/rank_zero.py", line 32, in wrapped_fn
    return fn(*args, **kwargs)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/lightning_fabric/loggers/csv_logs.py", line 141, in save
    self.experiment.save()
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/loggers/csv_logs.py", line 61, in save
    save_hparams_to_yaml(hparams_file, self.hparams)
  File "/Users/bilelomrani/Documents/ILLUIN.nosync/instructions-finetuning/.venv/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 292, in save_hparams_to_yaml
    raise RuntimeError(f"Missing folder: {os.path.dirname(config_yaml)}.")
RuntimeError: Missing folder: gs://llm-bucket/checkpoints/debug/lightning_logs/version_0.

Environment

Current environment
  • CUDA:
    - GPU: None
    - available: False
    - version: None
  • Lightning:
    - lightning-utilities: 0.9.0
    - pytorch-lightning: 2.0.4
    - torch: 2.0.1
    - torchmetrics: 0.11.4
  • Packages:
    - accelerate: 0.20.3
    - aiohttp: 3.8.4
    - aiosignal: 1.3.1
    - appdirs: 1.4.4
    - async-timeout: 4.0.2
    - attrs: 23.1.0
    - bitsandbytes: 0.39.1
    - black: 23.3.0
    - cachetools: 5.3.1
    - certifi: 2023.5.7
    - chardet: 5.1.0
    - charset-normalizer: 3.1.0
    - click: 8.1.3
    - configue: 4.2.0
    - coolname: 2.2.0
    - coverage: 7.2.7
    - datasets: 2.13.1
    - decorator: 5.1.1
    - deepspeed: 0.9.5
    - deptry: 0.12.0
    - dill: 0.3.6
    - docker: 6.1.3
    - docker-pycreds: 0.4.0
    - einops: 0.6.1
    - exceptiongroup: 1.1.2
    - filelock: 3.12.2
    - fire: 0.5.0
    - frozenlist: 1.3.3
    - fsspec: 2023.6.0
    - gcsfs: 2023.6.0
    - gitdb: 4.0.10
    - gitpython: 3.1.31
    - google-api-core: 2.11.1
    - google-auth: 2.21.0
    - google-auth-oauthlib: 1.0.0
    - google-cloud-core: 2.3.2
    - google-cloud-storage: 2.10.0
    - google-crc32c: 1.5.0
    - google-resumable-media: 2.5.0
    - googleapis-common-protos: 1.59.1
    - greenlet: 2.0.2
    - hjson: 3.1.0
    - huggingface-hub: 0.15.1
    - idna: 3.4
    - iniconfig: 2.0.0
    - instruction-finetuning: 0.1.dev89+g6a38a51.d20230710
    - instructions-finetuning: 0.1.dev75+gee76c64.d20230630
    - jinja2: 3.1.2
    - jiwer: 3.0.2
    - joblib: 1.3.1
    - lightning-utilities: 0.9.0
    - markupsafe: 2.1.3
    - mpmath: 1.3.0
    - multidict: 6.0.4
    - multiprocess: 0.70.14
    - mypy: 1.4.1
    - mypy-extensions: 1.0.0
    - networkx: 3.1
    - ninja: 1.11.1
    - numpy: 1.25.0
    - oauthlib: 3.2.2
    - packaging: 23.1
    - pandas: 2.0.3
    - pandasql: 0.7.3
    - pathspec: 0.11.1
    - pathtools: 0.1.2
    - peft: 0.4.0.dev0
    - platformdirs: 3.8.1
    - pluggy: 1.2.0
    - protobuf: 3.20.3
    - psutil: 5.9.5
    - py-cpuinfo: 9.0.0
    - pyarrow: 12.0.1
    - pyasn1: 0.5.0
    - pyasn1-modules: 0.3.0
    - pydantic: 1.10.11
    - pytest: 7.4.0
    - pytest-mock: 3.11.1
    - python-dateutil: 2.8.2
    - pytorch-lightning: 2.0.4
    - pytz: 2023.3
    - pyyaml: 5.4.1
    - rapidfuzz: 2.13.7
    - regex: 2023.6.3
    - requests: 2.31.0
    - requests-oauthlib: 1.3.1
    - rsa: 4.9
    - ruff: 0.0.277
    - safetensors: 0.3.1
    - scikit-learn: 1.3.0
    - scipy: 1.11.1
    - sentencepiece: 0.1.99
    - sentry-sdk: 1.26.0
    - setproctitle: 1.3.2
    - setuptools: 68.0.0
    - six: 1.16.0
    - smmap: 5.0.0
    - sqlalchemy: 2.0.17
    - sympy: 1.12
    - termcolor: 2.3.0
    - threadpoolctl: 3.1.0
    - tiktoken: 0.4.0
    - tokenizers: 0.13.3
    - tomli: 2.0.1
    - torch: 2.0.1
    - torchmetrics: 0.11.4
    - tqdm: 4.65.0
    - transformers: 4.30.2
    - types-google-cloud-ndb: 2.1.0.7
    - types-tqdm: 4.65.0.1
    - typing-extensions: 4.6.3
    - tzdata: 2023.3
    - urllib3: 1.26.16
    - wandb: 0.15.4
    - websocket-client: 1.6.1
    - xxhash: 3.2.0
    - yarl: 1.9.2
  • System:
    - OS: Darwin
    - architecture:
    - 64bit
    -
    - processor: i386
    - python: 3.9.17
    - release: 22.3.0
    - version: Darwin Kernel Version 22.3.0: Mon Jan 30 20:42:11 PST 2023; root:xnu-8792.81.3~2/RELEASE_X86_64

More info

No response

cc @Borda

@bilelomrani1 bilelomrani1 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Jul 10, 2023
@bilelomrani1
Copy link
Contributor Author

I managed to trace it back to this issue in gcsfs. Essentially, the first call to fs.isdir returns False even when the 'folder' exists, while the second call properly returns True. I also noticed that a similar behavior was observed and fixed in Lightning back in 2021 in #7889.

@awaelchli awaelchli added 3rd party Related to a 3rd-party logger: csv help wanted Open to be worked on and removed needs triage Waiting to be triaged by maintainers labels Jul 10, 2023
@bilelomrani1
Copy link
Contributor Author

bilelomrani1 commented Jul 10, 2023

@awaelchli I have a fix for this particular test case. I am not extremely confident in the correctness of my fix, I am quite confused by how gcsfs handles directories, not only is fs.isdir not idempotent, but fs.makedirs is a no-op, hence breaking the CSVLogger (and probably other things). Still if you are interested I would be happy to contribute a pull request for this.

@awaelchli
Copy link
Contributor

@bilelomrani1 I know absolutely nothing about gcsfs and won't be able to test it myself most likely. Feel free to send a PR. If the tests pass, it is probabaly a good sign that your fix can work. If not, we need to see how much effort it would be. No promises, but thanks for looking into it! It is much appreciated.

@celpas
Copy link

celpas commented Oct 20, 2023

I'm still having the same issue when using TensorBoardLogger with an S3 uri path as the save directory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party Related to a 3rd-party bug Something isn't working help wanted Open to be worked on logger: csv ver: 2.0.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants