-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. #310
Comments
You can set |
@SuperXiang Maybe you can try install every dependencies with exact version required in setup.py. I found out I installed lighteval main branch but actually a specific commit is required. Now I get an expected score of 83.4 in math_500 test. |
can you share the requirments.txt of your project?thank you! |
https://github.com/huggingface/open-r1/blob/main/setup.py#43 |
can you tell me your lighteval version? thank you very much! |
https://github.com/huggingface/open-r1/blob/main/setup.py#57 |
Same prolem. any solutions? |
I'm seeing the same issue. Reproduce:
have also tried with the specific lighteval version from the setup.py
EDIT:for some reason this sequence is working but the above does not
|
Args:
MODEL=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,max_model_length=32768,gpu_memory_utilization=0.95,tensor_parallel_size=$NUM_GPUS"
lighteval vllm $MODEL_ARGS "custom|math_500|0|0" --custom-tasks src/open_r1/evaluate.py --use-chat-template --output-dir $OUTPUT_DIR
Output:
[2025-02-13 16:26:29,410] [ INFO]: Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. (utils.py:148)
[2025-02-13 16:26:29,410] [ INFO]: NumExpr defaulting to 8 threads. (utils.py:161)
[2025-02-13 16:26:29,655] [ INFO]: PyTorch version 2.5.1+cu124 available. (config.py:54)
[2025-02-13 16:26:29,656] [ INFO]: JAX version 0.4.34 available. (config.py:125)
INFO 02-13 16:26:32 init.py:190] Automatically detected platform cuda.
[2025-02-13 16:26:32,423] [ INFO]: --- LOADING MODEL --- (pipeline.py:178)
[2025-02-13 16:26:36,965] [ INFO]: This model supports multiple tasks: {'score', 'classify', 'embed', 'generate', 'reward'}. Defaulting to 'generate'. (config.py:542)
[2025-02-13 16:26:36,966] [ INFO]: Initializing a V0 LLM engine (v0.7.2) with config: model='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=main, override_neuron_config=None, tokenizer_revision=main, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=1234, served_model_name=/hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, (llm_engine.py:234)
[2025-02-13 16:26:37,756] [ INFO]: Using Flash Attention backend. (cuda.py:230)
[2025-02-13 16:26:37,984] [ INFO]: Starting to load model /hdd_2/ltg/DeepSeek-R1-Distill-Qwen-1.5B... (model_runner.py:1110)
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.89it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.88it/s]
[2025-02-13 16:26:38,583] [ INFO]: Loading model weights took 3.3460 GB (model_runner.py:1115)
[2025-02-13 16:26:40,308] [ INFO]: Memory profiling takes 1.57 seconds
the current vLLM instance can use total_gpu_memory (23.68GiB) x gpu_memory_utilization (0.95) = 22.50GiB
model weights take 3.35GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 2.02GiB; the rest of the memory reserved for KV Cache is 17.07GiB. (worker.py:267)
[2025-02-13 16:26:40,464] [ INFO]: # CUDA blocks: 39960, # CPU blocks: 9362 (executor_base.py:110)
[2025-02-13 16:26:40,464] [ INFO]: Maximum concurrency for 32768 tokens per request: 19.51x (executor_base.py:115)
[2025-02-13 16:26:42,042] [ INFO]: Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing
gpu_memory_utilization
or switching to eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage. (model_runner.py:1434)Capturing CUDA graph shapes: 100%|███████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:11<00:00, 2.92it/s]
[2025-02-13 16:26:54,043] [ INFO]: Graph capturing finished in 12 secs, took 0.20 GiB (model_runner.py:1562)
[2025-02-13 16:26:54,043] [ INFO]: init engine (profile, create kv cache, warmup model) took 15.46 seconds (llm_engine.py:431)
[2025-02-13 16:26:54,314] [ INFO]: --- LOADING TASKS --- (pipeline.py:205)
[2025-02-13 16:26:54,350] [ WARNING]: If you want to use extended_tasks, make sure you installed their dependencies using
pip install -e .[extended_tasks]
. (registry.py:136)[2025-02-13 16:26:54,350] [ INFO]: Found 4 custom tasks in /home/v2x/.cache/huggingface/modules/datasets_modules/datasets/evaluate/9b9f9758846da6d31571ff29e2b3ba5204d29838565fa53ebe3238e83a3d2801/evaluate.py (registry.py:141)
[2025-02-13 16:26:54,352] [ INFO]: HuggingFaceH4/MATH-500 default (lighteval_task.py:187)
[2025-02-13 16:26:54,352] [ WARNING]: Careful, the task custom|math_500 is using evaluation data to build the few shot examples. (lighteval_task.py:261)
[2025-02-13 16:27:03,164] [ INFO]: --- INIT SEEDS --- (pipeline.py:234)
[2025-02-13 16:27:03,164] [ INFO]: --- RUNNING MODEL --- (pipeline.py:439)
[2025-02-13 16:27:03,164] [ INFO]: Running RequestType.GREEDY_UNTIL requests (pipeline.py:443)
[2025-02-13 16:27:03,214] [ WARNING]: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring. (data.py:260)
Splits: 0%| | 0/1 [00:00<?, ?it/s][2025-02-13 16:27:03,253] [ WARNING]: context_size + max_new_tokens=33562 which is greather than self.max_length=32768. Truncating context to 0 tokens. (vllm_model.py:268)
Processed prompts: 100%|███████████████████████████████████| 500/500 [00:02<00:00, 193.02it/s, est. speed input: 14271.87 toks/s, output: 3088.31 toks/s]
Splits: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00, 2.66s/it]
......
[2025-02-13 16:27:07,421] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:517)
The text was updated successfully, but these errors were encountered: