FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access `/.triton` in OpenShift #367

kpouget · 2024-10-08T08:37:49Z

Describe the bug

When running QLoRA on OpenShift, FMS-HF-Tuning crashes because it cannot access /.triton

Platform

OpenShift AI, quay.io/modh/fms-hf-tuning:v2.0.1

Sample Code

Running this configuration:

+ cat /mnt/config/config.json
{
  "training_data_path": "/mnt/output/dataset.json",
  "model_name_or_path": "/mnt/storage/model/mistral-7b-v0.3-gptq",
  "response_template": " ### Label:",
  "output_dir": "/mnt/output/fine-tuning",
  "save_model_dir": "/mnt/output/save-model-dir",
  "accelerate_launch_args": {
    "num_processes": 8,
    "num_machines": 1,
    "mixed_precision": "no",
    "dynamo_backend": "no",
    "downcast_bf16": "no",
    "main_training_function": "main",
    "rdzv_backend": "static",
    "same_network": true,
    "tpu_use_sudo": false,
    "use_fsdp": true
  },
  "num_train_epochs": 1,
  "per_device_train_batch_size": 4,
  "per_device_eval_batch_size": 4,
  "gradient_accumulation_steps": 4,
  "eval_strategy": "no",
  "save_strategy": "no",
  "learning_rate": 0.00001,
  "weight_decay": 0,
  "lr_scheduler_type": "cosine",
  "max_seq_length": 1024,
  "include_tokens_per_second": true,
  "dataset_text_field": "output",
  "use_flash_attn": true,
  "auto_gptq": [
    "triton_v2"
  ],
  "fp16": true,
  "gradient_checkpointing": true,
  "lora_alpha": 16,
  "max_steps": -1,
  "packing": false,
  "peft_method": "lora",
  "r": 4,
  "target_modules": [
    "all-linear"
  ],
  "torch_dtype": "float16",
  "warmup_ratio": 0.03
}

Expected behavior

The container image should run fine-tuning without having to specify any extra environment variable

Observed behavior

  File "/home/tuning/.local/lib/python3.11/site-packages/triton/runtime/cache.py", line 64, in __init__
    os.makedirs(self.cache_dir, exist_ok=True)
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 215, in makedirs
  File "<frozen os>", line 225, in makedirs
PermissionError: [Errno 13] Permission denied: '/.triton'

pod.log

Additional context

Work around is to define these environment variables, and make them point to a writable directory:

            env:
            - name: TRITON_HOME
              value: "/mnt/output"
            - name: TRITON_DUMP_DIR
              value: "/mnt/output"
            - name: TRITON_CACHE_DIR
              value: "/mnt/output"
            - name: TRITON_OVERRIDE_DIR
              value: "/mnt/output"

The text was updated successfully, but these errors were encountered:

fabianlim · 2024-10-08T10:43:26Z

@kpouget thanks for finding this. Yea theres a lot of these cache directories that we dont even pay attention to until something crashes.

anhuong linked a pull request Oct 9, 2024 that will close this issue

build: Set triton environment variables #370

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access `/.triton` in OpenShift #367

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access `/.triton` in OpenShift #367

kpouget commented Oct 8, 2024

fabianlim commented Oct 8, 2024

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access /.triton in OpenShift #367

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access /.triton in OpenShift #367

Comments

kpouget commented Oct 8, 2024

Describe the bug

Platform

Sample Code

Expected behavior

Observed behavior

Additional context

fabianlim commented Oct 8, 2024

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access `/.triton` in OpenShift #367

FMS-HF-Tuning QLoRA fine-tuning crashes because it cannot access `/.triton` in OpenShift #367