Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access /.triton in OpenShift #367

Open
kpouget opened this issue Oct 8, 2024 · 1 comment · May be fixed by #370
Open

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access /.triton in OpenShift #367

kpouget opened this issue Oct 8, 2024 · 1 comment · May be fixed by #370

Comments

@kpouget
Copy link

kpouget commented Oct 8, 2024

Describe the bug

When running QLoRA on OpenShift, FMS-HF-Tuning crashes because it cannot access /.triton

Platform

OpenShift AI, quay.io/modh/fms-hf-tuning:v2.0.1

Sample Code

Running this configuration:

+ cat /mnt/config/config.json
{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3-gptq",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 8,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false,
    "use_fsdp": true
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "auto_gptq": [
    "triton_v2"
  ],
  "fp16": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "torch_dtype": "float16",
  "warmup_ratio": 0.03
}

Expected behavior

The container image should run fine-tuning without having to specify any extra environment variable

Observed behavior

  File "/home/tuning/.local/lib/python3.11/site-packages/triton/runtime/cache.py", line 64, in __init__
    os.makedirs(self.cache_dir, exist_ok=True)
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 225, in makedirs
PermissionError: [Errno 13] Permission denied: '/.triton'

pod.log

Additional context

Work around is to define these environment variables, and make them point to a writable directory:

            env:
            - name: TRITON_HOME
              value: "/mnt/output"
            - name: TRITON_DUMP_DIR
              value: "/mnt/output"
            - name: TRITON_CACHE_DIR
              value: "/mnt/output"
            - name: TRITON_OVERRIDE_DIR
              value: "/mnt/output"
@fabianlim
Copy link
Collaborator

@kpouget thanks for finding this. Yea theres a lot of these cache directories that we dont even pay attention to until something crashes.

@anhuong anhuong linked a pull request Oct 9, 2024 that will close this issue
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants