-
Notifications
You must be signed in to change notification settings - Fork 827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add support for Phi4 #3090
Comments
Thanks! @adarshxs and @ravi03071991 is on this. |
Hey @Stealthwriter thanks for raising the issue. Phi-4 is based on Phi-3 architecture which is already supported by sglang. you can run the model by spinning up a server like this: python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0 |
I tried its not working, can you try it please?
…On Fri, Jan 24, 2025, 10:18 AM Adarsh Shirawalmath ***@***.***> wrote:
Hey @Stealthwriter <https://github.com/Stealthwriter> thanks for raising
the issue. Phi-4 is based on Phi-3 architecture which is already supported
by sglang. you can run the model by spinning up a server like this:
python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0
—
Reply to this email directly, view it on GitHub
<#3090 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/A6HMVXAOW6TFS5JKVWLQHTD2MHLL5AVCNFSM6AAAAABVYQEJ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJRGY4DQNBXGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@adarshxs Are you sure? 🤔 |
Yep I just ran it @zhaochenyang20 . @Stealthwriter can you share you error trace? |
My logs after I spin up the server with click to expandroot@f19a14f0c9f9:/workspace/sglang# python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0 [2025-01-24 09:38:21] server_args=ServerArgs(model_path='microsoft/phi-4', tokenizer_path='microsoft/phi-4', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='microsoft/phi-4', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=764393099, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False) config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 820/820 [00:00<00:00, 5.65MB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.7k/17.7k [00:00<00:00, 39.9MB/s] vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61M/1.61M [00:00<00:00, 10.9MB/s] merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 917k/917k [00:00<00:00, 44.1MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.25M/4.25M [00:00<00:00, 24.1MB/s] added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.50k/2.50k [00:00<00:00, 33.0MB/s] special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 1.00MB/s] [2025-01-24 09:38:28 TP0] Init torch distributed begin. [2025-01-24 09:38:28 TP0] Load weight begin. avail mem=78.84 GB [2025-01-24 09:38:29 TP0] Using model weights format ['*.safetensors'] model-00001-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.93G/4.93G [00:07<00:00, 675MB/s] model-00002-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.95G/4.95G [00:06<00:00, 716MB/s] model-00003-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.90G/4.90G [00:07<00:00, 649MB/s] model-00004-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.77G/4.77G [00:06<00:00, 691MB/s] model-00005-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.77G/4.77G [00:07<00:00, 672MB/s] model-00006-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.99G/4.99G [00:08<00:00, 590MB/s] model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.4k/20.4k [00:00<00:00, 48.5MB/s] Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00[2025-01-24 09:39:20 TP0] Load weight end. type=Phi3ForCausalLM, dtype=torch.bfloat16, avail mem=51.36 GB[2025-01-24 09:39:20 TP0] KV Cache is allocated. K size: 20.95 GB, V size: 20.95 GB. [2025-01-24 09:39:20 TP0] Memory pool end. avail mem=9.08 GB [2025-01-24 09:39:21 TP0] Capture cuda graph begin. This can take up to several minutes. 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:06<00:00, 3.37it/s] [2025-01-24 09:39:27 TP0] Capture cuda graph end. Time elapsed: 6.83 s [2025-01-24 09:39:28 TP0] max_total_num_tokens=219684, max_prefill_tokens=16384, max_running_requests=4097, context_len=16384 [2025-01-24 09:39:28] INFO: Started server process [568] [2025-01-24 09:39:28] INFO: Waiting for application startup. [2025-01-24 09:39:28] INFO: Application startup complete. [2025-01-24 09:39:28] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit) [2025-01-24 09:39:29] INFO: 127.0.0.1:45766 - "GET /get_model_info HTTP/1.1" 200 OK [2025-01-24 09:39:29 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0 [2025-01-24 09:39:31] INFO: 127.0.0.1:45768 - "POST /generate HTTP/1.1" 200 OK [2025-01-24 09:39:31] The server is fired up and ready to roll! Running it: import subprocess, json
curl_command = """
curl -s http://localhost:30000/v1/chat/completions \
-d '{"model": "microsoft/phi-4", "messages": [{"role": "user", "content": "What model are you?"}]}'
"""
response = json.loads(subprocess.check_output(curl_command, shell=True))
print(response) Output:
Where Phi-3(and Phi-4 since same architecture) are supported: sglang/python/sglang/srt/models/llama.py Line 571 in 3ed0a54
@zhaochenyang20 @Stealthwriter Ensure you have a neat installation of SGLang. You can follow the docs here My system details: |
@adarshxs Great! Thanks for help. And you can PR to update the docs: |
Checklist
Motivation
Please add support for Phi4, it's very powerful, vllm has it already
Related resources
No response
The text was updated successfully, but these errors were encountered: