Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streamline LLM import/compile/serve user experience #691

Open
2 of 8 tasks
ScottTodd opened this issue Dec 12, 2024 · 2 comments
Open
2 of 8 tasks

Streamline LLM import/compile/serve user experience #691

ScottTodd opened this issue Dec 12, 2024 · 2 comments
Labels
documentation Improvements or additions to documentation enhancement New feature or request

Comments

@ScottTodd
Copy link
Member

ScottTodd commented Dec 12, 2024

Here is our current documentation for running llama models through shark-ai: https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md (permalink).

Our current steps are convoluted compared to similar documentation in other projects.

Comparison with other projects

vLLM

TensorRT-LLM

MLC LLM

Ollama

TorchServe

Feedback

I have some inlined comments on one file in this commit on my fork. Here I'll try to summarize them as tasks:

  • Only one line in that guide is actually device-specific: --iree-hip-target=gfx942. The file should be generalized.
  • Any actions that create files (downloading, importing, compiling, etc.) should choose a consistent and structured default location. Similarly, tool arguments should have sensible default values, so the documentation does not need to specify 9+ environment variables
  • The sharktank.utils.hf_datasets module is a tool for developer convenience and testing. Users should be directed to download from huggingface directly, using standard tools and APIs. For most cases this means either the huggingface-cli tool (https://huggingface.co/docs/huggingface_hub/main/en/guides/cli) or the huggingface_hub library (https://huggingface.co/docs/hub/en/models-downloading)
    • We should refer to models with their standard "organization/repository" naming, not the shorthand from our hf_datasets file, e.g. SanctumAI/Meta-Llama-3.1-8B-Instruct-GGUF and not llama3_8B_fp16
    • That script does group files from multiple repos (typically the quantized .gguf from one repo the the tokenizer .json from another). We should be more flexible there... maybe include a standard/default tokenizer config ourselves?
    • Whatever tooling we have should work across multiple LLM models (llama 2, llama 3, mistral, mixtral, gemma, etc.). If models don't work then the tools should produce good error messages. If models require special export code then that needs some project architecture work. We shouldn't be directing users to a script called sharktank.examples.export_paged_llm_v1.
  • We could group some subsets of the [download, export to mlir, compile, serve] steps into a new tool, similar to how the projects linked above handle it. This could leverage iree.build for some or all steps: Find More General and Easier to use Alternative For Compiling Models for Shortfin LLM Server #402
    • Such a script could default to the current GPU device(s) and include options for multi-device sharding and other high level configurations that users/developers may want to use
  • The golden path through the documentation should be free of text like "if you see this error, run this to fix it" or "make sure you have these files here before running this command". The error messages should explain how to fix them and the defaults should work for most users.
  • Our current server seems to have its own API? We should use industry standard APIs like OpenAI's API : [tracking] Production Grade Shortfin-LLM #245
  • Demo commands do not need to use a python interpreter to send HTTP requests. Other projects provide a client library or use curl.
  • Server shutdown needs a better story than kill -9 $shortfin_process: Improve Method for Shutting Down Shortfin LLM Server #490
@ScottTodd ScottTodd added documentation Improvements or additions to documentation enhancement New feature or request labels Dec 12, 2024
@ScottTodd
Copy link
Member Author

ScottTodd commented Dec 19, 2024

I just went through the current docs at https://github.com/nod-ai/shark-ai/blob/main/docs/shortfin/llm/user/e2e_llama8b_mi300x.md. Good news - everything worked on this machine, using Python 3.11 on Ubuntu 22.04 and a w7900 (with --iree-hip-target=gfx1100)

A few more notes:

  • Export failed on Python 3.10 with

      File "/home/nod/dev/projects/shark-ai/3.11.llama.venv/lib/python3.10/site-packages/sharktank/types/tensors.py", line 499
        self.as_torch()[*key] = unbox_tensor(value)
                        ^
    

    Docs already suggest 3.11 as the minimum, but we can do better.

  • The environment variables, if we keep them, should put more context in the file names:

    -    export MLIR_PATH=$EXPORT_DIR/model.mlir
    -    export VMFB_PATH=$EXPORT_DIR/model.vmfb
    +    export MLIR_PATH=$EXPORT_DIR/llama_8b_fp16_bs1_bs4.mlir
    +    export VMFB_PATH=$EXPORT_DIR/llama_8b_fp16_bs1_bs4_rocm_gfx1100.vmfb

    Building tools to group the download/import/compile steps would help keep the names and metadata organized

  • The sharktank.examples.export_paged_llm_v1 script (I'd like to at least rename it...) has no progress indicators and takes around 1 minute for the provided 8b example on this machine. We could at least log some expectation about how long it will take, so users know how long to give it before giving up.

  • As mentioned in some other points, Running the server with python -m shortfin_apps.llm.server, grabbing the process ID, and redirecting output to a file is awkward. We can add a console script entry point like shortfin_server, build logging redirection into the script itself, and add a better shutdown method

  • The output for
    payload = {"text": "Name the capital of the United States.", "sampling_params": {"max_completion_tokens": 50}}
    is
    data: Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington
    on my machine. Should it be repeated that much? Maybe we should pick a different example prompt.

    edit: tried again and got a different response in 1/4 runs:
    Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, D.C.! Washington, Delsingelsingelsingelsingelsingelsingelsingeroneroneroneron이에 andija centrif blush blush blush blush and and and and and and and and and

ScottTodd added a commit that referenced this issue Dec 19, 2024
Progress on #691, trying to
simplify a few steps before putting this into release notes for 3.1.0.

* Add suggested `export/` directory to `.gitignore` (I'd prefer for the
tools to default to a path in the user's homedir, but this is a less
invasive change)
* Remove `sharktank` from install instructions as it is included in
`shark-ai` nightly releases now
* Rework "Verify server" section to start with a health check then use
`curl`. Keep the Python sample code for now, though similar projects
usually also have a Python API for interfacing with LLMs. We also don't
use a standardized HTTP API yet (like the OpenAI API). Maybe the SGLang
integration will be more natural for users.
ScottTodd added a commit that referenced this issue Dec 19, 2024
Progress on #691.

* Generalize guide to any llama model on any accelerator, mentioning
specifics of platform/model support where it matters
* Restructure the introduction section with more context and an overview
of the rest of the guide (more still to do here, explaining this tech
stack)
* Add prerequisites section, modeled after the [user
guide](https://github.com/nod-ai/shark-ai/blob/main/docs/user_guide.md)
* Add more "why" explanations for many steps (more still to do here)
* Start trimming environment variable instructions
@ScottTodd
Copy link
Member Author

[ ] Our current server seems to have its own API? We should use industry standard APIs like OpenAI's API : [tracking] Production Grade Shortfin-LLM #245

Quoting https://sgl-project.github.io/frontend/frontend.html,

The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow.

We should be able to support both / multiple APIs, but focusing on SGLang for now makes sense.

monorimet pushed a commit that referenced this issue Jan 8, 2025
Progress on #691, trying to
simplify a few steps before putting this into release notes for 3.1.0.

* Add suggested `export/` directory to `.gitignore` (I'd prefer for the
tools to default to a path in the user's homedir, but this is a less
invasive change)
* Remove `sharktank` from install instructions as it is included in
`shark-ai` nightly releases now
* Rework "Verify server" section to start with a health check then use
`curl`. Keep the Python sample code for now, though similar projects
usually also have a Python API for interfacing with LLMs. We also don't
use a standardized HTTP API yet (like the OpenAI API). Maybe the SGLang
integration will be more natural for users.
monorimet pushed a commit that referenced this issue Jan 8, 2025
Progress on #691.

* Generalize guide to any llama model on any accelerator, mentioning
specifics of platform/model support where it matters
* Restructure the introduction section with more context and an overview
of the rest of the guide (more still to do here, explaining this tech
stack)
* Add prerequisites section, modeled after the [user
guide](https://github.com/nod-ai/shark-ai/blob/main/docs/user_guide.md)
* Add more "why" explanations for many steps (more still to do here)
* Start trimming environment variable instructions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant