[Feature] Add support for Phi4 #3090

Stealthwriter · 2025-01-23T23:48:28Z

Checklist

1. If the issue you raised is not a feature but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
2. Please use English, otherwise it will be closed.

Motivation

Please add support for Phi4, it's very powerful, vllm has it already

Related resources

No response

zhaochenyang20 · 2025-01-24T00:47:58Z

Thanks! @adarshxs and @ravi03071991 is on this.

adarshxs · 2025-01-24T06:18:17Z

Hey @Stealthwriter thanks for raising the issue. Phi-4 is based on Phi-3 architecture which is already supported by sglang. you can run the model by spinning up a server like this:

python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0

Stealthwriter · 2025-01-24T07:58:58Z

I tried its not working, can you try it please?

…

On Fri, Jan 24, 2025, 10:18 AM Adarsh Shirawalmath ***@***.***> wrote: Hey @Stealthwriter <https://github.com/Stealthwriter> thanks for raising the issue. Phi-4 is based on Phi-3 architecture which is already supported by sglang. you can run the model by spinning up a server like this: python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0 — Reply to this email directly, view it on GitHub <#3090 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/A6HMVXAOW6TFS5JKVWLQHTD2MHLL5AVCNFSM6AAAAABVYQEJ4KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJRGY4DQNBXGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

zhaochenyang20 · 2025-01-24T08:16:31Z

@adarshxs Are you sure? 🤔

adarshxs · 2025-01-24T08:18:34Z

Yep I just ran it @zhaochenyang20 . @Stealthwriter can you share you error trace?

adarshxs · 2025-01-24T09:50:17Z

My logs after I spin up the server with python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0:

click to expand

root@f19a14f0c9f9:/workspace/sglang# python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0 [2025-01-24 09:38:21] server_args=ServerArgs(model_path='microsoft/phi-4', tokenizer_path='microsoft/phi-4', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='microsoft/phi-4', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=764393099, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False) config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 820/820 [00:00<00:00, 5.65MB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.7k/17.7k [00:00<00:00, 39.9MB/s] vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61M/1.61M [00:00<00:00, 10.9MB/s] merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 917k/917k [00:00<00:00, 44.1MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.25M/4.25M [00:00<00:00, 24.1MB/s] added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.50k/2.50k [00:00<00:00, 33.0MB/s] special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 1.00MB/s] [2025-01-24 09:38:28 TP0] Init torch distributed begin. [2025-01-24 09:38:28 TP0] Load weight begin. avail mem=78.84 GB [2025-01-24 09:38:29 TP0] Using model weights format ['*.safetensors'] model-00001-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.93G/4.93G [00:07<00:00, 675MB/s] model-00002-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.95G/4.95G [00:06<00:00, 716MB/s] model-00003-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.90G/4.90G [00:07<00:00, 649MB/s] model-00004-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.77G/4.77G [00:06<00:00, 691MB/s] model-00005-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.77G/4.77G [00:07<00:00, 672MB/s] model-00006-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.99G/4.99G [00:08<00:00, 590MB/s] model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.4k/20.4k [00:00<00:00, 48.5MB/s] Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00[2025-01-24 09:39:20 TP0] Load weight end. type=Phi3ForCausalLM, dtype=torch.bfloat16, avail mem=51.36 GB
[2025-01-24 09:39:20 TP0] KV Cache is allocated. K size: 20.95 GB, V size: 20.95 GB.
[2025-01-24 09:39:20 TP0] Memory pool end. avail mem=9.08 GB
[2025-01-24 09:39:21 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:06<00:00, 3.37it/s]
[2025-01-24 09:39:27 TP0] Capture cuda graph end. Time elapsed: 6.83 s
[2025-01-24 09:39:28 TP0] max_total_num_tokens=219684, max_prefill_tokens=16384, max_running_requests=4097, context_len=16384
[2025-01-24 09:39:28] INFO: Started server process [568]
[2025-01-24 09:39:28] INFO: Waiting for application startup.
[2025-01-24 09:39:28] INFO: Application startup complete.
[2025-01-24 09:39:28] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-24 09:39:29] INFO: 127.0.0.1:45766 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-24 09:39:29 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-24 09:39:31] INFO: 127.0.0.1:45768 - "POST /generate HTTP/1.1" 200 OK
[2025-01-24 09:39:31] The server is fired up and ready to roll!

Running it:

import subprocess, json

curl_command = """
curl -s http://localhost:30000/v1/chat/completions \
  -d '{"model": "microsoft/phi-4", "messages": [{"role": "user", "content": "What model are you?"}]}'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print(response)

Output:

{
  "id": "688ac6440bdc4e6c950807838432f9ef",
  "object": "chat.completion",
  "created": 1737711769,
  "model": "nicrosoft/phi-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am a language model developed by Microsoft, known as Phi. My design is geared towards understanding and generating human-like text to assist with a wide range of inquiries, from answering questions to providing explanations on various topics. If you have any questions or need assistance, feel free to ask!",
        "tool_calls": null,
        "logprobs": null
      },
      "finish_reason": "stop",
      "matched_stop": 100257
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 7,
    "completion_tokens": 59,
    "prompt_tokens_details": null
  }
}

Where Phi-3(and Phi-4 since same architecture) are supported:

sglang/python/sglang/srt/models/llama.py

Line 571 in 3ed0a54

class Phi3ForCausalLM(LlamaForCausalLM):

@zhaochenyang20 @Stealthwriter

Ensure you have a neat installation of SGLang. You can follow the docs here

My system details:
Python 3.10.12
NVIDIA A100-SXM4-80GB NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7
sglang: 0.4.1.post7
torch: 2.5.1+cu124

zhaochenyang20 · 2025-01-24T17:59:32Z

@adarshxs Great! Thanks for help. And you can PR to update the docs:

https://docs.sglang.ai/references/supported_models.html

zhaochenyang20 self-assigned this Jan 24, 2025

zhaochenyang20 added the help wanted Extra attention is needed label Jan 24, 2025

adarshxs mentioned this issue Jan 24, 2025

[Docs] minor update for phi-3 and phi-4 #3096

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add support for Phi4 #3090

[Feature] Add support for Phi4 #3090

Stealthwriter commented Jan 23, 2025

zhaochenyang20 commented Jan 24, 2025

adarshxs commented Jan 24, 2025

Stealthwriter commented Jan 24, 2025 via email

zhaochenyang20 commented Jan 24, 2025

adarshxs commented Jan 24, 2025

adarshxs commented Jan 24, 2025 •

edited

Loading

zhaochenyang20 commented Jan 24, 2025

[Feature] Add support for Phi4 #3090

[Feature] Add support for Phi4 #3090

Comments

Stealthwriter commented Jan 23, 2025

Checklist

Motivation

Related resources

zhaochenyang20 commented Jan 24, 2025

adarshxs commented Jan 24, 2025

Stealthwriter commented Jan 24, 2025 via email

zhaochenyang20 commented Jan 24, 2025

adarshxs commented Jan 24, 2025

adarshxs commented Jan 24, 2025 • edited Loading

zhaochenyang20 commented Jan 24, 2025

adarshxs commented Jan 24, 2025 •

edited

Loading