Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference with longer context (16k) outputs nonsensical numbers and symbols #1544

Closed
ignaceHelsen opened this issue Jan 15, 2025 · 7 comments

Comments

@ignaceHelsen
Copy link

ignaceHelsen commented Jan 15, 2025

Hello,

I've been enjoying using unsloth and I've trained my first lora with a lora trianing context of 32768.
I have been doing inference tests with lower context lengths and the output is just normal text like I finetuned.

However, when I start going over ~12k tokens, the output is (I capped it at the first 24 tokens for this showcase but it goes on):

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
 ( (�. 1. 1. 1. 1. 2. 1. 1

My code:

max_seq_length = 32768
dtype = None 
load_in_4bit = True
use_gradient_checkpointing = "unsloth"
    
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="outputs/lora_model",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
)

FastLanguageModel.for_inference(model)
messages = [
    {
        "role": "user",
        "content": query,
    },
]
input_ids = tokenizer.apply_chat_template(
     messages,
     add_generation_prompt=True,
     return_tensors="pt",
).to("cuda")
       
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
output = model.generate(input_ids, streamer = text_streamer, max_new_tokens = 24, temperature=0.5, pad_token_id = tokenizer.eos_token_id)

I counted the number of tokens in query which is 21090 for this example. I checked my prompt for any strange words that might cause this but the query seems fine.

Could this be because I'm close to filling the context length of my trained lora (32k)?
I have been looking around for issues with similar problems but I couldn't find any so hence my post here :)

Any help is greatly appreciated!

@danielhanchen
Copy link
Contributor

Would you happen to know which model? Does the original base model support long context? If no, then sadly this is expected - if your dataset is not long, then doing too long sequence lengths won't work.

A trick is to mix your dataset with some very long context examples from Hugging Face public datasets

@ignaceHelsen
Copy link
Author

ignaceHelsen commented Jan 16, 2025

@danielhanchen Thank you for your reponse :)

The model used is unsloth/Llama-3.3-70B-Instruct-bnb-4bit so this seems owkey.
I saw this page that stated that unsloth supports 89K long context for this model.
My dataset I used to train does also contain longer contexts.

Edit: I will add some much longer questions and see if it improves. I will update here.

@danielhanchen
Copy link
Contributor

@ignaceHelsen Oh wait another possibility is our inference engine is broken somehow on super long sequences - would it be possible to do inference using Hugging Face natively and seeing if it works fine (vs Unsloth's fast inference?)

@danielhanchen
Copy link
Contributor

@Erland366 Could you check if if inference on longer than 16K works as expected - thanks - maybe telling it to count from 1 to 1000000 etc

@ignaceHelsen
Copy link
Author

ignaceHelsen commented Jan 20, 2025

@danielhanchen I finally had the chance to test it out again, sorry for the delay. My previous dataset had samples that were at most ~5.4k tokens long, shorter than I thought. I added some samples that were around 25k and 40k tokens long and testing it now, it seems like it's outputting normal text. I will test using vllm (awq) and perhaps if time permits a gguf soon. Will keep you updated.

@ignaceHelsen
Copy link
Author

ignaceHelsen commented Jan 23, 2025

Update

AWQ's output did not seem to make sense at first so I switched to gguf with very satisfying results. Everything seems fine.
For me, this issue can be closed.

Thanks once again for the responses :)

@danielhanchen
Copy link
Contributor

Ok great! Glad you solved the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants