[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310

CHNtentes · 2025-02-13T08:30:56Z

Args:
MODEL=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.95,tensor_parallel_size=$NUM_GPUS"
lighteval vllm $MODEL_ARGS "custom|math_500|0|0" --custom-tasks src/open_r1/evaluate.py --use-chat-template --output-dir $OUTPUT_DIR

Output:
[2025-02-13 16:26:29,410] [ INFO]: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. (utils.py:148)
[2025-02-13 16:26:29,410] [ INFO]: NumExpr defaulting to 8 threads. (utils.py:161)
[2025-02-13 16:26:29,655] [ INFO]: PyTorch version 2.5.1+cu124 available. (config.py:54)
[2025-02-13 16:26:29,656] [ INFO]: JAX version 0.4.34 available. (config.py:125)
INFO 02-13 16:26:32 init.py:190] Automatically detected platform cuda.
[2025-02-13 16:26:32,423] [ INFO]: --- LOADING MODEL --- (pipeline.py:178)
[2025-02-13 16:26:36,965] [ INFO]: This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'. (config.py:542)
[2025-02-13 16:26:36,966] [ INFO]: Initializing a V0 LLM engine (v0.7.2) with config: model='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, (llm_engine.py:234)
[2025-02-13 16:26:37,756] [ INFO]: Using Flash Attention backend. (cuda.py:230)
[2025-02-13 16:26:37,984] [ INFO]: Starting to load model /hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110)
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.88it/s]

[2025-02-13 16:26:38,583] [ INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115)
[2025-02-13 16:26:40,308] [ INFO]: Memory profiling takes 1.57 seconds
the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.95) = 22.50GiB
model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 2.02GiB; the rest of the memory reserved for KV Cache is 17.07GiB. (worker.py:267)
[2025-02-13 16:26:40,464] [ INFO]: # CUDA blocks: 39960, # CPU blocks: 9362 (executor_base.py:110)
[2025-02-13 16:26:40,464] [ INFO]: Maximum concurrency for 32768 tokens per request: 19.51x (executor_base.py:115)
[2025-02-13 16:26:42,042] [ INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (model_runner.py:1434)
Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:11<00:00, 2.92it/s]
[2025-02-13 16:26:54,043] [ INFO]: Graph capturing finished in 12 secs, took 0.20 GiB (model_runner.py:1562)
[2025-02-13 16:26:54,043] [ INFO]: init engine (profile, create kv cache, warmup model) took 15.46 seconds (llm_engine.py:431)
[2025-02-13 16:26:54,314] [ INFO]: --- LOADING TASKS --- (pipeline.py:205)
[2025-02-13 16:26:54,350] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using pip install -e .[extended_tasks]. (registry.py:136)
[2025-02-13 16:26:54,350] [ INFO]: Found 4 custom tasks in /home/v2x/.cache/huggingface/modules/datasets_modules/datasets/evaluate/9b9f9758846da6d31571ff29e2b3ba5204d29838565fa53ebe3238e83a3d2801/evaluate.py (registry.py:141)
[2025-02-13 16:26:54,352] [ INFO]: HuggingFaceH4/MATH-500 default (lighteval_task.py:187)
[2025-02-13 16:26:54,352] [ WARNING]: Careful, the task custom|math_500 is using evaluation data to build the few shot examples. (lighteval_task.py:261)
[2025-02-13 16:27:03,164] [ INFO]: --- INIT SEEDS --- (pipeline.py:234)
[2025-02-13 16:27:03,164] [ INFO]: --- RUNNING MODEL --- (pipeline.py:439)
[2025-02-13 16:27:03,164] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:443)
[2025-02-13 16:27:03,214] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-02-13 16:27:03,253] [ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:268)
Processed prompts: 100%|███████████████████████████████████| 500/500 [00:02<00:00, 193.02it/s, est. speed input: 14271.87 toks/s, output: 3088.31 toks/s]
Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.66s/it]
......
[2025-02-13 16:27:07,421] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:517)

Task	Version	Metric	Value		Stderr
all		extractive_match	0.006	±	0.0035
custom:math_500:0	1	extractive_match	0.006	±	0.0035

The text was updated successfully, but these errors were encountered:

eldarkurtic · 2025-02-13T09:38:23Z

You can set max_model_length to a slightly larger value, for example: max_model_length=38000

SuperXiang · 2025-02-13T10:19:30Z

The same problem. Besides, the 'predictions' always output 16 tokens with this parameter settings, which leads to score as 0 since no matching results to 'gold label'. Any help to solve this problem?

Args: MODEL=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.95,tensor_parallel_size=$NUM_GPUS" lighteval vllm $MODEL_ARGS "custom|math_500|0|0" --custom-tasks src/open_r1/evaluate.py --use-chat-template --output-dir $OUTPUT_DIR

Output: [2025-02-13 16:26:29,410] [ INFO]: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. (utils.py:148) [2025-02-13 16:26:29,410] [ INFO]: NumExpr defaulting to 8 threads. (utils.py:161) [2025-02-13 16:26:29,655] [ INFO]: PyTorch version 2.5.1+cu124 available. (config.py:54) [2025-02-13 16:26:29,656] [ INFO]: JAX version 0.4.34 available. (config.py:125) INFO 02-13 16:26:32 init.py:190] Automatically detected platform cuda. [2025-02-13 16:26:32,423] [ INFO]: --- LOADING MODEL --- (pipeline.py:178) [2025-02-13 16:26:36,965] [ INFO]: This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'. (config.py:542) [2025-02-13 16:26:36,966] [ INFO]: Initializing a V0 LLM engine (v0.7.2) with config: model='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, (llm_engine.py:234) [2025-02-13 16:26:37,756] [ INFO]: Using Flash Attention backend. (cuda.py:230) [2025-02-13 16:26:37,984] [ INFO]: Starting to load model /hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.89it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.88it/s]

[2025-02-13 16:26:38,583] [ INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115) [2025-02-13 16:26:40,308] [ INFO]: Memory profiling takes 1.57 seconds the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.95) = 22.50GiB model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 2.02GiB; the rest of the memory reserved for KV Cache is 17.07GiB. (worker.py:267) [2025-02-13 16:26:40,464] [ INFO]: # CUDA blocks: 39960, # CPU blocks: 9362 (executor_base.py:110) [2025-02-13 16:26:40,464] [ INFO]: Maximum concurrency for 32768 tokens per request: 19.51x (executor_base.py:115) [2025-02-13 16:26:42,042] [ INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing gpu_memory_utilization or switching to eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage. (model_runner.py:1434) Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:11<00:00, 2.92it/s] [2025-02-13 16:26:54,043] [ INFO]: Graph capturing finished in 12 secs, took 0.20 GiB (model_runner.py:1562) [2025-02-13 16:26:54,043] [ INFO]: init engine (profile, create kv cache, warmup model) took 15.46 seconds (llm_engine.py:431) [2025-02-13 16:26:54,314] [ INFO]: --- LOADING TASKS --- (pipeline.py:205) [2025-02-13 16:26:54,350] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using pip install -e .[extended_tasks]. (registry.py:136) [2025-02-13 16:26:54,350] [ INFO]: Found 4 custom tasks in /home/v2x/.cache/huggingface/modules/datasets_modules/datasets/evaluate/9b9f9758846da6d31571ff29e2b3ba5204d29838565fa53ebe3238e83a3d2801/evaluate.py (registry.py:141) [2025-02-13 16:26:54,352] [ INFO]: HuggingFaceH4/MATH-500 default (lighteval_task.py:187) [2025-02-13 16:26:54,352] [ WARNING]: Careful, the task custom|math_500 is using evaluation data to build the few shot examples. (lighteval_task.py:261) [2025-02-13 16:27:03,164] [ INFO]: --- INIT SEEDS --- (pipeline.py:234) [2025-02-13 16:27:03,164] [ INFO]: --- RUNNING MODEL --- (pipeline.py:439) [2025-02-13 16:27:03,164] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:443) [2025-02-13 16:27:03,214] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260) Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-02-13 16:27:03,253] [ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:268) Processed prompts: 100%|███████████████████████████████████| 500/500 [00:02<00:00, 193.02it/s, est. speed input: 14271.87 toks/s, output: 3088.31 toks/s] Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.66s/it] ...... [2025-02-13 16:27:07,421] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:517)

Task Version Metric Value Stderr
all extractive_match 0.006 ± 0.0035
custom:math_500:0 1 extractive_match 0.006 ± 0.0035

CHNtentes · 2025-02-14T01:55:03Z

@SuperXiang Maybe you can try install every dependencies with exact version required in setup.py. I found out I installed lighteval main branch but actually a specific commit is required. Now I get an expected score of 83.4 in math_500 test.

ChengXu0001 · 2025-02-14T07:41:53Z

can you share the requirments.txt of your project?thank you!

CHNtentes · 2025-02-14T07:50:22Z

can you share the requirments.txt of your project?thank you!

https://github.com/huggingface/open-r1/blob/main/setup.py#43

ChengXu0001 · 2025-02-14T07:59:42Z

can you tell me your lighteval version? thank you very much!

CHNtentes · 2025-02-14T08:10:38Z

can you tell me your lighteval version? thank you very much!

https://github.com/huggingface/open-r1/blob/main/setup.py#57

zhengdong914 · 2025-02-14T09:18:44Z

Same prolem. any solutions?

rawsh · 2025-02-14T20:39:35Z

I'm seeing the same issue. Reproduce:

conda create -n eval python=3.11
conda activate eval
pip install vllm==0.7.2 git+https://github.com/huggingface/lighteval.git#egg=lighteval math-verify==0.5.2

have also tried with the specific lighteval version from the setup.py

pip install vllm==0.7.2 git+https://github.com/huggingface/lighteval.git@86f62259f105ae164f655e0b91c92a823a742724#egg=lighteval[math] math-verify==0.5.2

EDIT:

for some reason this sequence is working but the above does not

pip install vllm==0.7.2
pip install git+https://github.com/huggingface/lighteval.git@86f62259f105ae164f655e0b91c92a823a742724#egg=lighteval[math] math-verify==0.5.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310

[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310

CHNtentes commented Feb 13, 2025

eldarkurtic commented Feb 13, 2025

SuperXiang commented Feb 13, 2025 •

edited

Loading

CHNtentes commented Feb 14, 2025

ChengXu0001 commented Feb 14, 2025

CHNtentes commented Feb 14, 2025

ChengXu0001 commented Feb 14, 2025

CHNtentes commented Feb 14, 2025

zhengdong914 commented Feb 14, 2025

rawsh commented Feb 14, 2025 •

edited

Loading

[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310

[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310

Comments

CHNtentes commented Feb 13, 2025

eldarkurtic commented Feb 13, 2025

SuperXiang commented Feb 13, 2025 • edited Loading

CHNtentes commented Feb 14, 2025

ChengXu0001 commented Feb 14, 2025

CHNtentes commented Feb 14, 2025

ChengXu0001 commented Feb 14, 2025

CHNtentes commented Feb 14, 2025

zhengdong914 commented Feb 14, 2025

rawsh commented Feb 14, 2025 • edited Loading

EDIT:

SuperXiang commented Feb 13, 2025 •

edited

Loading

rawsh commented Feb 14, 2025 •

edited

Loading