Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Add support for Phi4 #3090

Open
2 tasks
Stealthwriter opened this issue Jan 23, 2025 · 7 comments
Open
2 tasks

[Feature] Add support for Phi4 #3090

Stealthwriter opened this issue Jan 23, 2025 · 7 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@Stealthwriter
Copy link

Checklist

Motivation

Please add support for Phi4, it's very powerful, vllm has it already

Related resources

No response

@zhaochenyang20
Copy link
Collaborator

Thanks! @adarshxs and @ravi03071991 is on this.

@zhaochenyang20 zhaochenyang20 self-assigned this Jan 24, 2025
@zhaochenyang20 zhaochenyang20 added the help wanted Extra attention is needed label Jan 24, 2025
@adarshxs
Copy link
Contributor

Hey @Stealthwriter thanks for raising the issue. Phi-4 is based on Phi-3 architecture which is already supported by sglang. you can run the model by spinning up a server like this:

python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0

@Stealthwriter
Copy link
Author

Stealthwriter commented Jan 24, 2025 via email

@zhaochenyang20
Copy link
Collaborator

@adarshxs Are you sure? 🤔

@adarshxs
Copy link
Contributor

Yep I just ran it @zhaochenyang20 . @Stealthwriter can you share you error trace?

@adarshxs
Copy link
Contributor

adarshxs commented Jan 24, 2025

My logs after I spin up the server with python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0:

click to expand root@f19a14f0c9f9:/workspace/sglang# python -m sglang.launch_server --model-path microsoft/phi-4 --port 30000 --host 0.0.0.0 [2025-01-24 09:38:21] server_args=ServerArgs(model_path='microsoft/phi-4', tokenizer_path='microsoft/phi-4', tokenizer_mode='auto', load_format='auto', trust_remote_code=False, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='microsoft/phi-4', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='0.0.0.0', port=30000, mem_fraction_static=0.88, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=False, tp_size=1, stream_interval=1, random_seed=764393099, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=False, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False) config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 820/820 [00:00<00:00, 5.65MB/s] tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.7k/17.7k [00:00<00:00, 39.9MB/s] vocab.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.61M/1.61M [00:00<00:00, 10.9MB/s] merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 917k/917k [00:00<00:00, 44.1MB/s] tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4.25M/4.25M [00:00<00:00, 24.1MB/s] added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.50k/2.50k [00:00<00:00, 33.0MB/s] special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 99.0/99.0 [00:00<00:00, 1.00MB/s] [2025-01-24 09:38:28 TP0] Init torch distributed begin. [2025-01-24 09:38:28 TP0] Load weight begin. avail mem=78.84 GB [2025-01-24 09:38:29 TP0] Using model weights format ['*.safetensors'] model-00001-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.93G/4.93G [00:07<00:00, 675MB/s] model-00002-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.95G/4.95G [00:06<00:00, 716MB/s] model-00003-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.90G/4.90G [00:07<00:00, 649MB/s] model-00004-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.77G/4.77G [00:06<00:00, 691MB/s] model-00005-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.77G/4.77G [00:07<00:00, 672MB/s] model-00006-of-00006.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 4.99G/4.99G [00:08<00:00, 590MB/s] model.safetensors.index.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.4k/20.4k [00:00<00:00, 48.5MB/s] Loading safetensors checkpoint shards: 0% Completed | 0/6 [00:00[2025-01-24 09:39:20 TP0] Load weight end. type=Phi3ForCausalLM, dtype=torch.bfloat16, avail mem=51.36 GB
[2025-01-24 09:39:20 TP0] KV Cache is allocated. K size: 20.95 GB, V size: 20.95 GB.
[2025-01-24 09:39:20 TP0] Memory pool end. avail mem=9.08 GB
[2025-01-24 09:39:21 TP0] Capture cuda graph begin. This can take up to several minutes.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23/23 [00:06<00:00, 3.37it/s]
[2025-01-24 09:39:27 TP0] Capture cuda graph end. Time elapsed: 6.83 s
[2025-01-24 09:39:28 TP0] max_total_num_tokens=219684, max_prefill_tokens=16384, max_running_requests=4097, context_len=16384
[2025-01-24 09:39:28] INFO: Started server process [568]
[2025-01-24 09:39:28] INFO: Waiting for application startup.
[2025-01-24 09:39:28] INFO: Application startup complete.
[2025-01-24 09:39:28] INFO: Uvicorn running on http://0.0.0.0:30000 (Press CTRL+C to quit)
[2025-01-24 09:39:29] INFO: 127.0.0.1:45766 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-24 09:39:29 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-24 09:39:31] INFO: 127.0.0.1:45768 - "POST /generate HTTP/1.1" 200 OK
[2025-01-24 09:39:31] The server is fired up and ready to roll!

Running it:

import subprocess, json

curl_command = """
curl -s http://localhost:30000/v1/chat/completions \
  -d '{"model": "microsoft/phi-4", "messages": [{"role": "user", "content": "What model are you?"}]}'
"""

response = json.loads(subprocess.check_output(curl_command, shell=True))
print(response)

Output:

{
  "id": "688ac6440bdc4e6c950807838432f9ef",
  "object": "chat.completion",
  "created": 1737711769,
  "model": "nicrosoft/phi-4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I am a language model developed by Microsoft, known as Phi. My design is geared towards understanding and generating human-like text to assist with a wide range of inquiries, from answering questions to providing explanations on various topics. If you have any questions or need assistance, feel free to ask!",
        "tool_calls": null,
        "logprobs": null
      },
      "finish_reason": "stop",
      "matched_stop": 100257
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 7,
    "completion_tokens": 59,
    "prompt_tokens_details": null
  }
}

Where Phi-3(and Phi-4 since same architecture) are supported:

class Phi3ForCausalLM(LlamaForCausalLM):

@zhaochenyang20 @Stealthwriter

Ensure you have a neat installation of SGLang. You can follow the docs here

My system details:
Python 3.10.12
NVIDIA A100-SXM4-80GB NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7
sglang: 0.4.1.post7
torch: 2.5.1+cu124

@zhaochenyang20
Copy link
Collaborator

@adarshxs Great! Thanks for help. And you can PR to update the docs:

https://docs.sglang.ai/references/supported_models.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants