-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: nrt_load_collectives status=4 message="Allocation Failure" #96
Comments
Hi @myang-tech42, Thanks for filing this issue, we'll take a look on our end. Could you also provide the following info:
|
|
Hi @myang-tech42 , Thanks for clarifying! Using the info I have, I wrote the below script based on the notebook example you shared: from transformers_neuronx import LlamaForSampling
model_path = "Meta-Llama-3.1-8B-Instruct"
context_length_estimate = [2**(i+6) for i in range(9)]
neuron_model = LlamaForSampling.from_pretrained(model_path, n_positions=16384, context_length_estimate=context_length_estimate, batch_size=1, tp_degree=8, amp='bf16')
neuron_model.to_neuron()
import time
import torch
from transformers import AutoTokenizer
import requests, re
# construct a tokenizer and encode prompt text
# For the prompt we take a recent publication (html format), strip html tags to convert to txt
# and ask the model to summarize the paper. The input length is 26k+ tokens.
tokenizer = AutoTokenizer.from_pretrained(model_path)
#prompt = re.sub('<[^<]+?>', '', requests.get("https://arxiv.org/html/2402.19427v1").text) # strip html tags
#prompt += "\n\n========================THE END======================\n"
#prompt += "A 10 point summary of the paper in simple words: "
prompt = "What is the capital of France, and could you give a detailed history of teh capital"
# put in prompt format https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/#prompt-format
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|> {prompt} <|eot_id|><|start_header_id|>assistant<|end_header_id|>"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
num_input_tokens = len(input_ids[0]) # over 26k tokens
print(f"num_input_tokens: {num_input_tokens}")
# run inference with top-k sampling
with torch.inference_mode():
start = time.time()
generated_sequences = neuron_model.sample(input_ids, sequence_length=32768, top_k=10)
elapsed = time.time() - start
# display the new generated tokens
generated_sequences = [tokenizer.decode(seq[num_input_tokens:]) for seq in generated_sequences]
print(f'generated sequence {generated_sequences[0]} in {elapsed} seconds') But I was unable to reproduce the issue you ran into with an |
Here are my dependencies. I tried below but doesnt get me to 2.21.1.
|
No, by release 2.21.1, I'm referring to the Neuron SDK, which is the entire collection of software associated with a release. With that being said, thanks for listing the pip dependencies, and they do appear to be part of the 2.21.1 Neuron SDK. However it looks like the sagemaker instance is not an Ubuntu instance, based on |
Looks like the runtime versions are older than expected. We will try to reproduce with the specified runtime versions. |
We found that the runtime packages might not be compatible with the compiler used. I suggest upgrading your runtime packages to the latest public ones, and this should fix the issue. |
Running into "RuntimeError: nrt_load_collectives status=4 message="Allocation Failure"" when trying to run
neuron_model.to_neuron()
in this notebook.Instead of 32k I adjusted to 16k with tp_degress of 8 on a ml.inf2.24xlarge.
The text was updated successfully, but these errors were encountered: