Skip to content

Commit

Permalink
Merge branch 'main' into angelayi/aoti_metadata
Browse files Browse the repository at this point in the history
  • Loading branch information
Jack-Khuu authored Nov 26, 2024
2 parents 452dc72 + de2507b commit 75908b8
Show file tree
Hide file tree
Showing 16 changed files with 253 additions and 233 deletions.
20 changes: 20 additions & 0 deletions .ci/scripts/run-docs
Original file line number Diff line number Diff line change
Expand Up @@ -91,3 +91,23 @@ if [ "$1" == "evaluation" ]; then
echo "*******************************************"
bash -x ./run-evaluation.sh
fi

if [ "$1" == "multimodal" ]; then

# Expecting that this might fail this test as-is, because
# it's the first on-pr test depending on githib secrets for access with HF token access

echo "::group::Create script to run multimodal"
python3 torchchat/utils/scripts/updown.py --file docs/multimodal.md > ./run-multimodal.sh
# for good measure, if something happened to updown processor,
# and it did not error out, fail with an exit 1
echo "exit 1" >> ./run-multimodal.sh
echo "::endgroup::"

echo "::group::Run multimodal"
echo "*******************************************"
cat ./run-multimodal.sh
echo "*******************************************"
bash -x ./run-multimodal.sh
echo "::endgroup::"
fi
45 changes: 44 additions & 1 deletion .github/workflows/run-readme-pr.yml
Original file line number Diff line number Diff line change
Expand Up @@ -243,4 +243,47 @@ jobs:
echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
echo "::endgroup::"
echo "::endgroup::"
test-multimodal-any:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.g5.4xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.1"
timeout: 60
script: |
echo "::group::Print machine info"
uname -a
echo "::endgroup::"
echo "::group::Install newer objcopy that supports --set-section-alignment"
yum install -y devtoolset-10-binutils
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
echo "::endgroup::"
.ci/scripts/run-docs multimodal
echo "::group::Completion"
echo "tests complete"
echo "*******************************************"
echo "::endgroup::"
test-multimodal-cpu:
uses: pytorch/test-infra/.github/workflows/linux_job.yml@main
with:
runner: linux.g5.4xlarge.nvidia.gpu
gpu-arch-type: cuda
gpu-arch-version: "12.1"
timeout: 60
script: |
echo "::group::Print machine info"
uname -a
echo "::endgroup::"
echo "::group::Install newer objcopy that supports --set-section-alignment"
yum install -y devtoolset-10-binutils
export PATH=/opt/rh/devtoolset-10/root/usr/bin/:$PATH
echo "::endgroup::"
TORCHCHAT_DEVICE=cpu .ci/scripts/run-docs multimodal
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,10 @@ runner-et/cmake-out/*
runner-aoti/cmake-out/*
cmake-out/

# Example project Android Studio ignore
torchchat/edge/android/torchchat/.idea/*


# pte files
*.pte

Expand Down
90 changes: 18 additions & 72 deletions docs/ADVANCED-USERS.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,10 @@ Torchchat is currently in a pre-release state and under extensive development.
[shell default]: TORCHCHAT_ROOT=${PWD} ./torchchat/utils/scripts/install_et.sh


This is the advanced users guide, if you're looking to get started
This is the advanced users' guide, if you're looking to get started
with LLMs, please refer to the README at the root directory of the
torchchat distro. This is an advanced user guide, so we will have
many more concepts and options to discuss and taking advantage of them
many more concepts and options to discuss and take advantage of them
may take some effort.

We welcome community contributions of all kinds. If you find
Expand All @@ -41,7 +41,7 @@ While we strive to support a broad range of models, we can't test them
all. We classify supported models as tested ✅, work in progress 🚧 or
some restrictions ❹.

We invite community contributions of new model suport and test results!
We invite community contributions of new model support and test results!

| Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
|-----|--------|-------|-----|-----|-----|-----|
Expand Down Expand Up @@ -86,7 +86,7 @@ Server C++ runtime | n/a | run.cpp model.pte | ✅ |
Mobile C++ runtime | n/a | app model.pte | ✅ |
Mobile C++ runtime | n/a | app + AOTI | 🚧 |

**Getting help:** Each command implements the --help option to give addititonal information about available options:
**Getting help:** Each command implements the --help option to give additional information about available options:

[skip default]: begin
```
Expand All @@ -96,8 +96,8 @@ python3 torchchat.py [ export | generate | chat | eval | ... ] --help

Exported models can be loaded back into torchchat for chat or text
generation, letting you experiment with the exported model and valid
model quality. The python interface is the same in all cases and is
used for testing nad test harnesses too.
model quality. The Python interface is the same in all cases and is
used for testing and test harnesses, too.

Torchchat comes with server C++ runtimes to execute AOT Inductor and
ExecuTorch models. A mobile C++ runtimes allow you to deploy
Expand All @@ -115,7 +115,7 @@ Some common models are recognized by torchchat based on their filename
through `Model.from_name()` to perform a fuzzy match against a
table of known model architectures. Alternatively, you can specify the
index into that table with the option `--params-table ${INDEX}` where
the index is the lookup key key in the [the list of known
the index is the lookup key in the [the list of known
pconfigurations](https://github.com/pytorch/torchchat/tree/main/torchchat/model_params)
For example, for the stories15M model, this would be expressed as
`--params-table stories15M`. (We use the model constructor
Expand Down Expand Up @@ -237,7 +237,7 @@ which chooses the best 16-bit floating point type.

The virtual device fast and virtual floating point data types fast and
fast16 are best used for eager/torch.compiled execution. For export,
specify the your device choice for the target system with --device for
specify your device choice for the target system with --device for
AOTI-exported DSO models, and using ExecuTorch delegate selection for
ExecuTorch-exported PTE models.

Expand All @@ -250,8 +250,7 @@ python3 torchchat.py generate [--compile] --checkpoint-path ${MODEL_PATH} --prom
To improve performance, you can compile the model with `--compile`
trading off the time to first token processed with time per token. To
improve performance further, you may also compile the prefill with
`--compile_prefill`. This will increase further compilation times though. The
`--compile-prefill` option is not compatible with `--prefill-prefill`.
`--compile-prefill`. This will increase further compilation times though.

Parallel prefill is not yet supported by exported models, and may be
supported in a future release.
Expand All @@ -265,7 +264,7 @@ the introductory README.
In addition to running eval on models in eager mode and JIT-compiled
mode with `torch.compile()`, you can also load dso and pte models back
into the PyTorch to evaluate the accuracy of exported model objects
(e.g., after applying quantization or other traqnsformations to
(e.g., after applying quantization or other transformations to
improve speed or reduce model size).

Loading exported models back into a Python-based Pytorch allows you to
Expand Down Expand Up @@ -297,14 +296,14 @@ for ExecuTorch.)

We export the stories15M model with the following command for
execution with the ExecuTorch runtime (and enabling execution on a
wide range of community and vendor supported backends):
wide range of community and vendor-supported backends):

```
python3 torchchat.py export --checkpoint-path ${MODEL_PATH} --output-pte-path ${MODEL_NAME}.pte
```

Alternatively, we may generate a native instruction stream binary
using AOT Inductor for CPU oor GPUs (the latter using Triton for
using AOT Inductor for CPU or GPUs (the latter using Triton for
optimizations such as operator fusion):

```
Expand All @@ -319,10 +318,10 @@ the exported model artifact back into a model container with a
compatible API surface for the `model.forward()` function. This
enables users to test, evaluate and exercise the exported model
artifact with familiar interfaces, and in conjunction with
pre-exiisting Python model unit tests and common environments such as
pre-existing Python model unit tests and common environments such as
Jupyter notebooks and/or Google colab.

Here is how to load an exported model into the python environment on the example of using an exported model with `generate.oy`.
Here is how to load an exported model into the Python environment using an exported model with the `generate` command.

```
python3 torchchat.py generate --checkpoint-path ${MODEL_PATH} --pte-path ${MODEL_NAME}.pte --device cpu --prompt "Once upon a time"
Expand Down Expand Up @@ -452,7 +451,7 @@ strategies:
You can find instructions for quantizing models in
[docs/quantization.md](file:///./quantization.md). Advantageously,
quantization is available in eager mode as well as during export,
enabling you to do an early exploration of your quantization setttings
enabling you to do an early exploration of your quantization settings
in eager mode. However, final accuracy should always be confirmed on
the actual execution target, since all targets have different build
processes, compilers, and kernel implementations with potentially
Expand All @@ -464,9 +463,8 @@ significant impact on accuracy.

## Native (Stand-Alone) Execution of Exported Models

Refer to the [README](README.md] for an introduction toNative
execution on servers, desktops and laptops is described under
[runner-build.md]. Mobile and Edge executipon for Android and iOS are
Refer to the [README](README.md] for an introduction to native
execution on servers, desktops, and laptops. Mobile and Edge execution for Android and iOS are
described under [torchchat/edge/docs/Android.md] and [torchchat/edge/docs/iOS.md], respectively.


Expand All @@ -475,7 +473,7 @@ described under [torchchat/edge/docs/Android.md] and [torchchat/edge/docs/iOS.md

PyTorch and ExecuTorch support a broad range of devices for running
PyTorch with python (using either eager or eager + `torch.compile`) or
in a python-free environment with AOT Inductor and ExecuTorch.
in a Python-free environment with AOT Inductor and ExecuTorch.


| Hardware | OS | Eager | Eager + Compile | AOT Compile | ET Runtime |
Expand All @@ -499,58 +497,6 @@ in a python-free environment with AOT Inductor and ExecuTorch.
*Key*: n/t -- not tested


## Runtime performance with Llama 7B, in tokens per second (4b quantization)

| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime |
|-----|------|-----|-----|-----|-----|
| x86 | Linux | ? | ? | ? | ? |
| x86 | macOS | ? | ? | ? | ? |
| aarch64 | Linux | ? | ? | ? | ? |
| aarch64 | macOS | ? | ? | ? | ? |
| AMD GPU | Linux | ? | ? | ? | ? |
| Nvidia GPU | Linux | ? | ? | ? | ? |
| MPS | macOS | ? | ? | ? | ? |
| MPS | iOS | ? | ? | ? | ? |
| aarch64 | Android | ? | ? | ? | ? |
| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? |
| CoreML | iOS | | ? | ? | ? | ? |
| Hexagon DSP | Android | | ? | ? | ? | ? |
| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? |
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |


## Runtime performance with Llama3, in tokens per second (4b quantization)

| Hardware | OS | eager | eager + compile | AOT compile | ET Runtime |
|-----|------|-----|-----|-----|-----|
| x86 | Linux | ? | ? | ? | ? |
| x86 | macOS | ? | ? | ? | ? |
| aarch64 | Linux | ? | ? | ? | ? |
| aarch64 | macOS | ? | ? | ? | ? |
| AMD GPU | Linux | ? | ? | ? | ? |
| Nvidia GPU | Linux | ? | ? | ? | ? |
| MPS | macOS | ? | ? | ? | ? |
| MPS | iOS | ? | ? | ? | ? |
| aarch64 | Android | ? | ? | ? | ? |
| Mobile GPU (Vulkan) | Android | ? | ? | ? | ? |
| CoreML | iOS | | ? | ? | ? | ? |
| Hexagon DSP | Android | | ? | ? | ? | ? |
| Raspberry Pi 4/5 | Raspbian | ? | ? | ? | ? |
| Raspberry Pi 4/5 | Android | ? | ? | ? | ? |
| ARM 32b (up to v7) | any | | ? | ? | ? | ? |




# CONTRIBUTING to torchchat

We welcome any feature requests, bug reports, or pull requests from
the community. See the [CONTRIBUTING](CONTRIBUTING.md) for
instructions how to contribute to torchchat.



# LICENSE

Torchchat is released under the [BSD 3 license](./LICENSE). However
Expand Down
12 changes: 11 additions & 1 deletion docs/multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,11 @@ This page goes over the different commands you can run with LLama 3.2 11B Vision

While we strongly encourage you to use the Hugging Face checkpoint (which is the default for torchchat when utilizing the commands with the argument `llama3.2-11B`), we also provide support for manually providing the checkpoint. This can be done by replacing the `llama3.2-11B` argument in the commands below with the following:

[skip default]: begin
```
--checkpoint-path <file.pth> --tokenizer-path <tokenizer.model> --params-path torchchat/model_params/Llama-3.2-11B-Vision.json
```
[skip default]: end

## Generation
This generates text output based on a text prompt and (optional) image prompt.
Expand All @@ -39,6 +41,9 @@ python3 torchchat.py server llama3.2-11B
```
[skip default]: end

[shell default]: python3 torchchat.py server llama3.2-11B & server_pid=$!


In another terminal, query the server using `curl`. This query might take a few minutes to respond.

<details>
Expand Down Expand Up @@ -71,10 +76,13 @@ curl http://127.0.0.1:5000/v1/chat/completions \
"max_tokens": 300
}'
```

[skip default]: begin
```
{"id": "chatcmpl-cb7b39af-a22e-4f71-94a8-17753fa0d00c", "choices": [{"message": {"role": "assistant", "content": "The image depicts a simple black and white cartoon-style drawing of an animal face. It features a profile view, complete with two ears, expressive eyes, and a partial snout. The animal looks to the left, with its eye and mouth implied, suggesting that the drawn face might belong to a rabbit, dog, or pig. The graphic face has a bold black outline and a smaller, solid black nose. A small circle, forming part of the face, has a white background with two black quirkly short and long curved lines forming an outline of what was likely a mouth, complete with two teeth. The presence of the curve lines give the impression that the animal is smiling or speaking. Grey and black shadows behind the right ear and mouth suggest that this face is looking left and upwards. Given the prominent outline of the head and the outline of the nose, it appears that the depicted face is most likely from the side profile of a pig, although the ears make it seem like a dog and the shape of the nose makes it seem like a rabbit. Overall, it seems that this image, possibly part of a character illustration, is conveying a playful or expressive mood through its design and positioning."}, "finish_reason": "stop"}], "created": 1727487574, "model": "llama3.2", "system_fingerprint": "cpu_torch.float16", "object": "chat.completion"}%
```
[skip default]: end

[shell default]: kill ${server_pid}

</details>

Expand All @@ -90,6 +98,8 @@ First, follow the steps in the Server section above to start a local server. The
streamlit run torchchat/usages/browser.py
```

[skip default]: end

---

# Future Work
Expand Down
6 changes: 3 additions & 3 deletions docs/native-execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,14 @@ The 'llama runner' is a native standalone application capable of
running a model exported and compiled ahead-of-time with either
Executorch (ET) or AOT Inductor (AOTI). Which model format to use
depends on your requirements and preferences. Executorch models are
optimized for portability across a range of decices, including mobile
optimized for portability across a range of devices, including mobile
and edge devices. AOT Inductor models are optimized for a particular
target architecture, which may result in better performance and
efficiency.

Building the runners is straightforward with the included cmake build
files and is covered in the next sections. We will showcase the
runners using ~~stories15M~~ llama2 7B and llama3.
runners using llama2 7B and llama3.

## What can you do with torchchat's llama runner for native execution?

Expand Down Expand Up @@ -160,7 +160,7 @@ and native execution environments, respectively.

After exporting a model, you will want to verify that the model
delivers output of high quality, and works as expected. Both can be
achieved with the Python environment. All torchchat Python comands
achieved with the Python environment. All torchchat Python commands
can work with exported models. Instead of loading the model from a
checkpoint or GGUF file, use the `--dso-path model.so` and
`--pte-path model.pte` for loading both types of exported models. This
Expand Down
Loading

0 comments on commit 75908b8

Please sign in to comment.