-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft : API Server #18
base: v1.1
Are you sure you want to change the base?
Conversation
fearnworks
commented
Oct 24, 2023
- Implements openai mock api server
- Implements basic api server
- Examples for consumption through jupyter notebook and gradio
- Includes streaming response example
Hey, quick question. Since vLLM doesn't support LoRA. How are you planning to have different experts loading at the same time? I'm asking as I've been trying to figure out the same but I didn't get anywhere |
We are currently exploring adding a custom mistral_moe model to vllm to handle loading the lora weights and any changes we need to apply to the forward passes. |
Ah I see. Makes sense. Would that work with tensor parallel too? |
Pivoting this to a general fastapi server |
This reverts commit e0dd9dc.