Bad quality in answers (repetition, non stop...) when using Llama3.1-8B-Instruct and Triton #603
Open
2 of 4 tasks
Labels
bug
Something isn't working
System Info
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
python3 test/TensorRT-LLM-12/examples/llama/convert_checkpoint.py --model_dir test/Meta-Llama-3.1-8B-Instruct/ --output_dir test/meta-chkpt --dtype bfloat16
trtllm-build --checkpoint_dir test/meta-chkpt/ \ --output_dir test/llama-3.1-engine/ \ --use_fused_mlp \ --gemm_plugin bfloat16 \ --gpt_attention_plugin bfloat16 \ --context_fmha enable \ --max_seq_len 12288
Expected behavior
The model provides accurate answer to the questions.
actual behavior
The model includes the answer in the question, provides a lot more tokens without stopping, it´s repetitive. Example:{
"text_input": "Q: What is the capital of France?. Answer:",
"parameters": {
"max_tokens": 50,
"bad_words":[""],
"stop_words":[""]
}
}
additional notes
I have tried with different types: bfloat and float when creating the engine, but the same problem happens.
The text was updated successfully, but these errors were encountered: