High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

niyathimariya · 2024-10-19T02:45:54Z

System Info

Optimum version: 1.22.0
Platform: Linux (Ubuntu 22.04.4 LTS)
Python version: 3.12.2
ONNX Runtime Version: 1.19.2
CUDA Version: 12.1
CUDA Execution Provider: Yes (CUDA 12.1)

Who can help?

@JingyaHuang @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

def load_model(self, model_name):
    session_options = ort.SessionOptions()
    session_options.add_session_config_entry('cudnn_conv_use_max_workspace', '0')
    session_options.enable_mem_pattern = False
    session_options.arena_extend_strategy = "kSameAsRequested"
    session_options.gpu_mem_limit = 10 * 1024 * 1024 * 1024
    
    model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider="CUDAExecutionProvider", session_options=session_options)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer, model

def inference(self, batch, doc_id='-1'):
    responses, status = '', False
    try:
        encodings = self.tokenizer(batch, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(self.device)
        with torch.no_grad():
            generated_ids = self.model.generate(
                encodings.input_ids,
                max_new_tokens=1024
            )
            responses = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
            status = True  
    except Exception as e:
        logger.error(f"Failed to do inference on LLM, error: {e}")

    torch.cuda.empty_cache()
    return status, responses

Expected behavior

I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.

The text was updated successfully, but these errors were encountered:

niyathimariya added the bug Something isn't working label Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

niyathimariya commented Oct 19, 2024 •

edited

Loading

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

High CUDA Memory Usage in ONNX Runtime with Inconsistent Memory Release #2069

Comments

niyathimariya commented Oct 19, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

niyathimariya commented Oct 19, 2024 •

edited

Loading