We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimum version: 1.22.0 Platform: Linux (Ubuntu 22.04.4 LTS) Python version: 3.12.2 ONNX Runtime Version: 1.19.2 CUDA Version: 12.1 CUDA Execution Provider: Yes (CUDA 12.1)
@JingyaHuang @echarlaix
examples
def load_model(self, model_name): session_options = ort.SessionOptions() session_options.add_session_config_entry('cudnn_conv_use_max_workspace', '0') session_options.enable_mem_pattern = False session_options.arena_extend_strategy = "kSameAsRequested" session_options.gpu_mem_limit = 10 * 1024 * 1024 * 1024 model = ORTModelForSeq2SeqLM.from_pretrained(model_name, provider="CUDAExecutionProvider", session_options=session_options) tokenizer = AutoTokenizer.from_pretrained(model_name) return tokenizer, model def inference(self, batch, doc_id='-1'): responses, status = '', False try: encodings = self.tokenizer(batch, padding=True, truncation=True, max_length=8192, return_tensors="pt").to(self.device) with torch.no_grad(): generated_ids = self.model.generate( encodings.input_ids, max_new_tokens=1024 ) responses = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True) status = True except Exception as e: logger.error(f"Failed to do inference on LLM, error: {e}") torch.cuda.empty_cache() return status, responses
I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
Who can help?
@JingyaHuang @echarlaix
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
Expected behavior
I expect the CUDA memory to decrease and be released after processing smaller inputs, optimizing memory usage for subsequent inputs.
The text was updated successfully, but these errors were encountered: