Support for sdxl pipeline (testing) #152

saienduri · 2024-04-03T09:05:09Z

This commit adds support for testing all sdxl submodels

iree_tests/pytorch/models/sdxl-prompt-encoder-tank/model.mlirbc

iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/real_weights_data_flags.txt

iree_tests/pytorch/models/sdxl-prompt-encoder-tank/test_cases.json

iree_tests/configs/config_sdxl_cpu_llvm_task.json

ScottTodd · 2024-04-03T15:56:24Z

iree_tests/conftest.py

+    def test_benchmark(self):
+        proc = subprocess.run(self.benchmark_args, capture_output=True, cwd=self.test_cwd)
+        if proc.returncode != 0:
+            raise IreeRunException(proc, self.test_cwd, self.compile_args)
+        outs = proc.stdout.decode("utf-8")
+        print(f"Stdout benchmark:\n{outs}\n")


Before this sort of change lands, let's think a bit about what we actually want coverage for. I'm skeptical about having benchmarks built into the same testing flow... though the suite that Stella set up has these:

https://github.com/openxla/iree/blob/2c88e49d184b621c763a5cfb4af693f8dcbc6a07/experimental/regression_suite/ireers/fixtures.py#L81-L94

https://github.com/openxla/iree/blob/2c88e49d184b621c763a5cfb4af693f8dcbc6a07/experimental/regression_suite/tests/pregenerated/test_llama2.py#L128-L148

Just because we care so much about sdxl perf, I think it would be great to have it included in this flow. I didn't look into adding a whole separate flow for it or making it very scalable because I doubt we will be adding benchmarks for anything else. That's why also just went with hardcoded commands (also flags have to be in conftest.py because some flag values are path names that are relative to other directories that we figure out there). I was thinking we can evaluate and iterate if this becomes a bigger utility but for now, just went for easiest/simple add for benchmarking. Here is an example log with everything running: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8543035287/job/23405926540

The value of continuous benchmarks is clear, but I want to be careful about how we integrate them. For this PR, can you leave the benchmarks off and focus on just adding the sdxl models? A follow-up PR can then add benchmarking.

I'd at least like to take some time to work through the specific requirements before jumping straight to an implementation. For example:

What metrics/artifacts do we want from benchmarking?

Each model in isolation? Full pipeline latency? Just dispatch time?

What do we want done with benchmark results / artifacts?

The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries

Where do we want benchmarks to run?

Right after tests, on presubmit to IREE?

In a separate job, on separate runners?

I'm also wondering if we want to use pytest as the benchmark runner (either with the existing conftest.py or a forked one), or if we would want to use another runner (could start with a pile of scripts, just using the same test suite source files)

It might be reasonable to start with pytest -k benchmark or pytest iree_tests/benchmarks that just run iree-benchmark-module instead of iree-run-module and then let developers dig through the GitHub Actions logs to see results, but I'm worried about going down the path of building an entirely new benchmark "framework" when we already have https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks (building something new is likely going to make sense, at least in the short term, but this stuff gets complicated very quickly)

this stuff gets complicated very quickly

For example: https://github.com/openxla/community/blob/main/rfcs/20230505-benchmarking-strategy.md

(think only way to split by runner is having them on different jobs).

I believe that can be done with a matrix too: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-using-a-multi-dimension-matrix

Could then check other matrix parameters to choose which steps to run... maybe like this:

jobs: iree_tests: strategy: matrix: configuration: - runner: ubuntu-22.04 test_command: "pytest iree_tests/simple --durations=0" - runner: ubuntu-22.04 test_command: "pytest iree_tests/onnx/node/generated -n auto -rpfE --timeout=30 --retries 2 --retry-delay 5 --durations=10" runs-on: ${{ matrix.runner }} steps: ...

I'm also referencing https://github.com/openxla/iree/blob/573ff1ff02347266ed747dd316cefaeb4c710396/.github/workflows/ci.yml#L749-L784 (probably tons of other files to reference across GitHub, but that's what I know already...)

If we wanted to plug in to the in-tree benchmark infrastructure that IREE has, we'd want a PR like iree-org/iree#16965 . That would feed into https://perf.iree.dev/ and PR comments, but it doesn't also test correctness out of the box, can be tricky to update (multiple Python files, coupled with GitHub Actions), and puts input files / parameters behind a few levels of abstraction that make it harder to run locally.

Yeah, I also wonder how that would work because we need to essentially compile multiple submodels and then use their vmfbs for the pipeline's vmfb. Not sure if it is setup for a pipeline structure.

Also, I've placed the onnx and model tests in different jobs. I think that's best for this suite. Because they don't depend on each other and are running independently on different machines, I don't think we need the sequential steps. This way we have parallel execution which can help for scalability in future. Once we get more machines, splitting on models also would be great :)

Here is the PR for benchmarking that should be landed after this one: #155. Feel free to add on notes there for future reference

ScottTodd · 2024-04-04T15:19:12Z

.github/workflows/test_iree.yml

      # TODO(scotttodd): add a local cache for these large files to a persistent runner
      - name: "Downloading remote files for real weight model tests"
        run: |
          source ${VENV_DIR}/bin/activate
-          python3 iree_tests/download_remote_files.py
+          python3 iree_tests/download_remote_files.py --root-dir pytorch/models


Oof this is taking 12 minutes to download: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8546974154/job/23418304591?pr=152#step:7:14

We'll really want that to be cached for presubmit.

.github/workflows/test_iree.yml

iree_tests/configs/config_sdxl_cpu_llvm_task.json

ScottTodd · 2024-04-04T15:35:54Z

iree_tests/pytorch/models/sdxl-prompt-encoder-tank/test_cases.json

+          ]
+        }
+      ]
+    }
+  ]
+}


Seeing more of these files, I'm still thinking about how to keep them easy to update. Might refactor to separate JSON files like

test_case_splats.json:

{ "name": "splats", "runtime_flagfile": "splat_data_flags.txt", "remote_files": [] }

test_case_real_weights.json:

{ "name": "real_weights", "runtime_flagfile": "real_weights_data_flags.txt", "remote_files": [ "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.0.bin", "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.1.bin", "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.2.bin", "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.3.bin", "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_output.0.bin", "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_output.1.bin", "https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/real_weights.irpa" ] }

Hmm yeah, this is probably easier to decode/update

ScottTodd · 2024-04-04T15:42:21Z

.github/workflows/test_iree.yml

-          pytest iree_tests -n auto -k real_weights -rpfE --timeout=600 --retries 2 --retry-delay 5 --durations=0
+          pytest iree_tests/pytorch/models -s -n auto -k real_weights -rpfE --timeout=1200 --retries 2 --retry-delay 5 --durations=0


We probably don't want -n auto. That's using 64 workers: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8546974154/job/23418304591?pr=152#step:8:23

It miiiight work out, but multiple large program compiles will compete for the CPU and multiple large program runs will compete for the GPU. I guess ideally the compiler could be using the CPU while the runtime is using the GPU. We also need to be very careful with that for benchmarks - we don't want one benchmark influencing another.

One thing we can do is have multi-stage builds where a CPU machine runs the compiler and a GPU machine runs the tests/benchmarks. IREE's in-tree benchmarks follow that model. As long as the large model weights are externalized into parameter files and the .vmfb files are small, we can limit where we upload/download large files to just where they are needed. If we want to get really fancy, we could spin up GPU machines at the same time as CPU machines and have the GPU machines start to fetch weights (orrr just have them cached already) while the CPU machines work on compiling.

Yeah, I removed -n auto on the benchmarking part for that reason. But, sure, we can remove it here also and figure out how we want to proceed in the future. A pipeline on the runner side of things would be cool. The cpu machines compile and pass the vmfbs to the gpu runners to run, but also might just be making things complicated. I'm thinking in terms of how long the CI run will take, instead of 64, do you think lowering it into 4 or something would work? (or is that also too much competition)

Yeah I've been using 4.

saienduri · 2024-04-04T19:55:21Z

@ScottTodd is this good to merge now?

This commit adds the V0 support for testing all sdxl submodels. This PR is not a long term solution for benchmarking as Scott and I discussed here: #152. This is the result of a request to get sdxl benchmarking in ASAP by our team. Due to the high priority for this to be added to the sdxl testing as our team lands patches in IREE, this is simply just to get the implementation in and working. Scott and I discussed some more intensive and well structured ways to add benchmarking, which either of us may implement in the future. Also, this PR depends on this one in terms of landing: #152 (hence the CI failure) Notes for future if we decide that we need a stronger implementation: 1. Maybe something like iree-org/iree#16965 which will feed into https://perf.iree.dev/. 2. This is the benchmarking framework we already have: https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks 3. Some questions Scott raised to keep in mind for future implementation: * What metrics/artifacts do we want from benchmarking? * Each model in isolation? Full pipeline latency? Just dispatch time? * What do we want done with benchmark results / artifacts? * The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries * Where do we want benchmarks to run? * Right after tests, on presubmit to IREE? * In a separate job, on separate runners? If we decide benchmarking needs changes, we will address all of these and come up with a more structured, methodical implementation that either creates a new benchmarking flow here or plugs into the iree benchmarking setup.

This commit adds support for testing all sdxl submodels

This commit adds the V0 support for testing all sdxl submodels. This PR is not a long term solution for benchmarking as Scott and I discussed here: #152. This is the result of a request to get sdxl benchmarking in ASAP by our team. Due to the high priority for this to be added to the sdxl testing as our team lands patches in IREE, this is simply just to get the implementation in and working. Scott and I discussed some more intensive and well structured ways to add benchmarking, which either of us may implement in the future. Also, this PR depends on this one in terms of landing: #152 (hence the CI failure) Notes for future if we decide that we need a stronger implementation: 1. Maybe something like iree-org/iree#16965 which will feed into https://perf.iree.dev/. 2. This is the benchmarking framework we already have: https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks 3. Some questions Scott raised to keep in mind for future implementation: * What metrics/artifacts do we want from benchmarking? * Each model in isolation? Full pipeline latency? Just dispatch time? * What do we want done with benchmark results / artifacts? * The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries * Where do we want benchmarks to run? * Right after tests, on presubmit to IREE? * In a separate job, on separate runners? If we decide benchmarking needs changes, we will address all of these and come up with a more structured, methodical implementation that either creates a new benchmarking flow here or plugs into the iree benchmarking setup.

These aren't currently used in either this repository or [iree-org/iree](https://github.com/iree-org/iree). https://github.com/nod-ai/SHARK-TestSuite/blob/25ba2648b76cca75733a435c59d5224e3f397e3b/.github/workflows/test_iree.yml#L82-L89 In fact, I'm not sure if these were ever used? They were branched from other files in #152.

saienduri added 11 commits April 3, 2024 02:02

update everything for sdxl testing + benchmarking

c2a0641

add vmfb that was ignored

c653d63

only prompt encoder

ec73eb7

update prompt encoder tolerance for torch out

60fab44

update prompt encoder tolerance for torch out again

13199b9

only vae_decode now

2a40a5b

try again

32e6b77

try again

fbb07bc

try again

7cebe0c

try unet

8e5bc62

update tolerance scheduled_unet

a78eeda

ScottTodd reviewed Apr 3, 2024

View reviewed changes

saienduri and others added 11 commits April 3, 2024 09:27

finalize config + new tolerance

b731fe7

Merge branch 'main' into sdxl_pipeline_sai_v2

ce347f7

update download

ede8a9e

git lfs

5eb2e7f

deal with new line

93a6b62

update to not lfs for vmfb

c7b3b3d

parallel execution different machines

0684a7f

revert benchmark

94494cc

update runner

015013b

update job names

99fcfe9

update name

351f673

saienduri changed the title ~~Support for sdxl pipeline (testing + benchmarking)~~ Support for sdxl pipeline (testing) Apr 4, 2024

saienduri mentioned this pull request Apr 4, 2024

Support for sdxl pipeline (benchmarking) #155

Merged

ScottTodd requested changes Apr 4, 2024

View reviewed changes

saienduri added 2 commits April 4, 2024 11:00

address comments

f99a50b

change num workers

043513e

saienduri requested a review from ScottTodd April 4, 2024 19:55

ScottTodd approved these changes Apr 4, 2024

View reviewed changes

saienduri merged commit 5596006 into main Apr 4, 2024
2 checks passed

saienduri deleted the sdxl_pipeline_sai_v2 branch April 4, 2024 20:24

renxida pushed a commit that referenced this pull request Jul 18, 2024

Support for sdxl pipeline (testing) (#152)

b8e771c

This commit adds support for testing all sdxl submodels

ScottTodd mentioned this pull request Dec 17, 2024

Delete iree_tests/future-pytorch-models/. #417

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for sdxl pipeline (testing) #152

Support for sdxl pipeline (testing) #152

saienduri commented Apr 3, 2024 •

edited

Loading

ScottTodd Apr 3, 2024

saienduri Apr 3, 2024 •

edited

Loading

ScottTodd Apr 3, 2024

ScottTodd Apr 3, 2024

ScottTodd Apr 3, 2024

ScottTodd Apr 3, 2024

ScottTodd Apr 3, 2024

saienduri Apr 4, 2024 •

edited

Loading

saienduri Apr 4, 2024

saienduri Apr 4, 2024

ScottTodd Apr 4, 2024

ScottTodd Apr 4, 2024

saienduri Apr 4, 2024

ScottTodd Apr 4, 2024

saienduri Apr 4, 2024

ScottTodd Apr 4, 2024

saienduri commented Apr 4, 2024

		pytest iree_tests -n auto -k real_weights -rpfE --timeout=600 --retries 2 --retry-delay 5 --durations=0
		pytest iree_tests/pytorch/models -s -n auto -k real_weights -rpfE --timeout=1200 --retries 2 --retry-delay 5 --durations=0

Support for sdxl pipeline (testing) #152

Support for sdxl pipeline (testing) #152

Conversation

saienduri commented Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

saienduri Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saienduri Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

saienduri commented Apr 4, 2024

saienduri commented Apr 3, 2024 •

edited

Loading

saienduri Apr 3, 2024 •

edited

Loading

saienduri Apr 4, 2024 •

edited

Loading