Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310

Open
CHNtentes opened this issue Feb 13, 2025 · 9 comments

Comments

@CHNtentes
Copy link

Args:
MODEL=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.95,tensor_parallel_size=$NUM_GPUS"
lighteval vllm $MODEL_ARGS "custom|math_500|0|0" --custom-tasks src/open_r1/evaluate.py --use-chat-template --output-dir $OUTPUT_DIR

Output:
[2025-02-13 16:26:29,410] [ INFO]: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. (utils.py:148)
[2025-02-13 16:26:29,410] [ INFO]: NumExpr defaulting to 8 threads. (utils.py:161)
[2025-02-13 16:26:29,655] [ INFO]: PyTorch version 2.5.1+cu124 available. (config.py:54)
[2025-02-13 16:26:29,656] [ INFO]: JAX version 0.4.34 available. (config.py:125)
INFO 02-13 16:26:32 init.py:190] Automatically detected platform cuda.
[2025-02-13 16:26:32,423] [ INFO]: --- LOADING MODEL --- (pipeline.py:178)
[2025-02-13 16:26:36,965] [ INFO]: This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'. (config.py:542)
[2025-02-13 16:26:36,966] [ INFO]: Initializing a V0 LLM engine (v0.7.2) with config: model='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, (llm_engine.py:234)
[2025-02-13 16:26:37,756] [ INFO]: Using Flash Attention backend. (cuda.py:230)
[2025-02-13 16:26:37,984] [ INFO]: Starting to load model /hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110)
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.88it/s]

[2025-02-13 16:26:38,583] [ INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115)
[2025-02-13 16:26:40,308] [ INFO]: Memory profiling takes 1.57 seconds
the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.95) = 22.50GiB
model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 2.02GiB; the rest of the memory reserved for KV Cache is 17.07GiB. (worker.py:267)
[2025-02-13 16:26:40,464] [ INFO]: # CUDA blocks: 39960, # CPU blocks: 9362 (executor_base.py:110)
[2025-02-13 16:26:40,464] [ INFO]: Maximum concurrency for 32768 tokens per request: 19.51x (executor_base.py:115)
[2025-02-13 16:26:42,042] [ INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:11<00:00, 2.92it/s]
[2025-02-13 16:26:54,043] [ INFO]: Graph capturing finished in 12 secs, took 0.20 GiB (model_runner.py:1562)
[2025-02-13 16:26:54,043] [ INFO]: init engine (profile, create kv cache, warmup model) took 15.46 seconds (llm_engine.py:431)
[2025-02-13 16:26:54,314] [ INFO]: --- LOADING TASKS --- (pipeline.py:205)
[2025-02-13 16:26:54,350] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using pip install -e .[extended_tasks]. (registry.py:136)
[2025-02-13 16:26:54,350] [ INFO]: Found 4 custom tasks in /home/v2x/.cache/huggingface/modules/datasets_modules/datasets/evaluate/9b9f9758846da6d31571ff29e2b3ba5204d29838565fa53ebe3238e83a3d2801/evaluate.py (registry.py:141)
[2025-02-13 16:26:54,352] [ INFO]: HuggingFaceH4/MATH-500 default (lighteval_task.py:187)
[2025-02-13 16:26:54,352] [ WARNING]: Careful, the task custom|math_500 is using evaluation data to build the few shot examples. (lighteval_task.py:261)
[2025-02-13 16:27:03,164] [ INFO]: --- INIT SEEDS --- (pipeline.py:234)
[2025-02-13 16:27:03,164] [ INFO]: --- RUNNING MODEL --- (pipeline.py:439)
[2025-02-13 16:27:03,164] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:443)
[2025-02-13 16:27:03,214] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-02-13 16:27:03,253] [ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:268)
Processed prompts: 100%|███████████████████████████████████| 500/500 [00:02<00:00, 193.02it/s, est. speed input: 14271.87 toks/s, output: 3088.31 toks/s]
Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.66s/it]
......
[2025-02-13 16:27:07,421] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:517)

Task Version Metric Value Stderr
all extractive_match 0.006 ± 0.0035
custom:math_500:0 1 extractive_match 0.006 ± 0.0035
@eldarkurtic
Copy link

You can set max_model_length to a slightly larger value, for example: max_model_length=38000

@SuperXiang
Copy link

SuperXiang commented Feb 13, 2025

The same problem. Besides, the 'predictions' always output 16 tokens with this parameter settings, which leads to score as 0 since no matching results to 'gold label'. Any help to solve this problem?

Image

Args: MODEL=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.95,tensor_parallel_size=$NUM_GPUS" lighteval vllm $MODEL_ARGS "custom|math_500|0|0" --custom-tasks src/open_r1/evaluate.py --use-chat-template --output-dir $OUTPUT_DIR

Output: [2025-02-13 16:26:29,410] [ INFO]: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. (utils.py:148) [2025-02-13 16:26:29,410] [ INFO]: NumExpr defaulting to 8 threads. (utils.py:161) [2025-02-13 16:26:29,655] [ INFO]: PyTorch version 2.5.1+cu124 available. (config.py:54) [2025-02-13 16:26:29,656] [ INFO]: JAX version 0.4.34 available. (config.py:125) INFO 02-13 16:26:32 init.py:190] Automatically detected platform cuda. [2025-02-13 16:26:32,423] [ INFO]: --- LOADING MODEL --- (pipeline.py:178) [2025-02-13 16:26:36,965] [ INFO]: This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'. (config.py:542) [2025-02-13 16:26:36,966] [ INFO]: Initializing a V0 LLM engine (v0.7.2) with config: model='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, (llm_engine.py:234) [2025-02-13 16:26:37,756] [ INFO]: Using Flash Attention backend. (cuda.py:230) [2025-02-13 16:26:37,984] [ INFO]: Starting to load model /hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.89it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.88it/s]

[2025-02-13 16:26:38,583] [ INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115) [2025-02-13 16:26:40,308] [ INFO]: Memory profiling takes 1.57 seconds the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.95) = 22.50GiB model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 2.02GiB; the rest of the memory reserved for KV Cache is 17.07GiB. (worker.py:267) [2025-02-13 16:26:40,464] [ INFO]: # CUDA blocks: 39960, # CPU blocks: 9362 (executor_base.py:110) [2025-02-13 16:26:40,464] [ INFO]: Maximum concurrency for 32768 tokens per request: 19.51x (executor_base.py:115) [2025-02-13 16:26:42,042] [ INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (model_runner.py:1434) Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:11<00:00, 2.92it/s] [2025-02-13 16:26:54,043] [ INFO]: Graph capturing finished in 12 secs, took 0.20 GiB (model_runner.py:1562) [2025-02-13 16:26:54,043] [ INFO]: init engine (profile, create kv cache, warmup model) took 15.46 seconds (llm_engine.py:431) [2025-02-13 16:26:54,314] [ INFO]: --- LOADING TASKS --- (pipeline.py:205) [2025-02-13 16:26:54,350] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using pip install -e .[extended_tasks]. (registry.py:136) [2025-02-13 16:26:54,350] [ INFO]: Found 4 custom tasks in /home/v2x/.cache/huggingface/modules/datasets_modules/datasets/evaluate/9b9f9758846da6d31571ff29e2b3ba5204d29838565fa53ebe3238e83a3d2801/evaluate.py (registry.py:141) [2025-02-13 16:26:54,352] [ INFO]: HuggingFaceH4/MATH-500 default (lighteval_task.py:187) [2025-02-13 16:26:54,352] [ WARNING]: Careful, the task custom|math_500 is using evaluation data to build the few shot examples. (lighteval_task.py:261) [2025-02-13 16:27:03,164] [ INFO]: --- INIT SEEDS --- (pipeline.py:234) [2025-02-13 16:27:03,164] [ INFO]: --- RUNNING MODEL --- (pipeline.py:439) [2025-02-13 16:27:03,164] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:443) [2025-02-13 16:27:03,214] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260) Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-02-13 16:27:03,253] [ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:268) Processed prompts: 100%|███████████████████████████████████| 500/500 [00:02<00:00, 193.02it/s, est. speed input: 14271.87 toks/s, output: 3088.31 toks/s] Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.66s/it] ...... [2025-02-13 16:27:07,421] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:517)

Task Version Metric Value Stderr
all extractive_match 0.006 ± 0.0035
custom:math_500:0 1 extractive_match 0.006 ± 0.0035

@CHNtentes
Copy link
Author

@SuperXiang Maybe you can try install every dependencies with exact version required in setup.py. I found out I installed lighteval main branch but actually a specific commit is required. Now I get an expected score of 83.4 in math_500 test.

@ChengXu0001
Copy link

can you share the requirments.txt of your project?thank you!

@CHNtentes
Copy link
Author

can you share the requirments.txt of your project?thank you!

https://github.com/huggingface/open-r1/blob/main/setup.py#43

@ChengXu0001
Copy link

can you tell me your lighteval version? thank you very much!

@CHNtentes
Copy link
Author

can you tell me your lighteval version? thank you very much!

https://github.com/huggingface/open-r1/blob/main/setup.py#57

@zhengdong914
Copy link

Same prolem. any solutions?

@rawsh
Copy link

rawsh commented Feb 14, 2025

I'm seeing the same issue. Reproduce:

conda create -n eval python=3.11
conda activate eval
pip install vllm==0.7.2 git+https://github.com/huggingface/lighteval.git#egg=lighteval math-verify==0.5.2

have also tried with the specific lighteval version from the setup.py

pip install vllm==0.7.2 git+https://github.com/huggingface/lighteval.git@86f62259f105ae164f655e0b91c92a823a742724#egg=lighteval[math] math-verify==0.5.2

EDIT:

for some reason this sequence is working but the above does not

pip install vllm==0.7.2
pip install git+https://github.com/huggingface/lighteval.git@86f62259f105ae164f655e0b91c92a823a742724#egg=lighteval[math] math-verify==0.5.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants