-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for sdxl pipeline (testing) #152
Conversation
iree_tests/pytorch/models/sdxl-prompt-encoder-tank/model.mlirbc
Outdated
Show resolved
Hide resolved
iree_tests/pytorch/models/sdxl-scheduled-unet-3-tank/real_weights_data_flags.txt
Outdated
Show resolved
Hide resolved
iree_tests/conftest.py
Outdated
def test_benchmark(self): | ||
proc = subprocess.run(self.benchmark_args, capture_output=True, cwd=self.test_cwd) | ||
if proc.returncode != 0: | ||
raise IreeRunException(proc, self.test_cwd, self.compile_args) | ||
outs = proc.stdout.decode("utf-8") | ||
print(f"Stdout benchmark:\n{outs}\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before this sort of change lands, let's think a bit about what we actually want coverage for. I'm skeptical about having benchmarks built into the same testing flow... though the suite that Stella set up has these:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because we care so much about sdxl perf, I think it would be great to have it included in this flow. I didn't look into adding a whole separate flow for it or making it very scalable because I doubt we will be adding benchmarks for anything else. That's why also just went with hardcoded commands (also flags have to be in conftest.py because some flag values are path names that are relative to other directories that we figure out there). I was thinking we can evaluate and iterate if this becomes a bigger utility but for now, just went for easiest/simple add for benchmarking. Here is an example log with everything running: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8543035287/job/23405926540
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value of continuous benchmarks is clear, but I want to be careful about how we integrate them. For this PR, can you leave the benchmarks off and focus on just adding the sdxl models? A follow-up PR can then add benchmarking.
I'd at least like to take some time to work through the specific requirements before jumping straight to an implementation. For example:
- What metrics/artifacts do we want from benchmarking?
- Each model in isolation? Full pipeline latency? Just dispatch time?
- What do we want done with benchmark results / artifacts?
- The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries
- Where do we want benchmarks to run?
- Right after tests, on presubmit to IREE?
- In a separate job, on separate runners?
I'm also wondering if we want to use pytest as the benchmark runner (either with the existing conftest.py or a forked one), or if we would want to use another runner (could start with a pile of scripts, just using the same test suite source files)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be reasonable to start with pytest -k benchmark
or pytest iree_tests/benchmarks
that just run iree-benchmark-module
instead of iree-run-module
and then let developers dig through the GitHub Actions logs to see results, but I'm worried about going down the path of building an entirely new benchmark "framework" when we already have https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks (building something new is likely going to make sense, at least in the short term, but this stuff gets complicated very quickly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this stuff gets complicated very quickly
For example: https://github.com/openxla/community/blob/main/rfcs/20230505-benchmarking-strategy.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(think only way to split by runner is having them on different jobs).
I believe that can be done with a matrix too: https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#example-using-a-multi-dimension-matrix
Could then check other matrix parameters to choose which steps to run... maybe like this:
jobs:
iree_tests:
strategy:
matrix:
configuration:
- runner: ubuntu-22.04
test_command: "pytest iree_tests/simple --durations=0"
- runner: ubuntu-22.04
test_command: "pytest iree_tests/onnx/node/generated -n auto -rpfE --timeout=30 --retries 2 --retry-delay 5 --durations=10"
runs-on: ${{ matrix.runner }}
steps:
...
I'm also referencing https://github.com/openxla/iree/blob/573ff1ff02347266ed747dd316cefaeb4c710396/.github/workflows/ci.yml#L749-L784 (probably tons of other files to reference across GitHub, but that's what I know already...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to plug in to the in-tree benchmark infrastructure that IREE has, we'd want a PR like iree-org/iree#16965 . That would feed into https://perf.iree.dev/ and PR comments, but it doesn't also test correctness out of the box, can be tricky to update (multiple Python files, coupled with GitHub Actions), and puts input files / parameters behind a few levels of abstraction that make it harder to run locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I also wonder how that would work because we need to essentially compile multiple submodels and then use their vmfbs for the pipeline's vmfb. Not sure if it is setup for a pipeline structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've placed the onnx and model tests in different jobs. I think that's best for this suite. Because they don't depend on each other and are running independently on different machines, I don't think we need the sequential steps. This way we have parallel execution which can help for scalability in future. Once we get more machines, splitting on models also would be great :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR for benchmarking that should be landed after this one: #155. Feel free to add on notes there for future reference
# TODO(scotttodd): add a local cache for these large files to a persistent runner | ||
- name: "Downloading remote files for real weight model tests" | ||
run: | | ||
source ${VENV_DIR}/bin/activate | ||
python3 iree_tests/download_remote_files.py | ||
python3 iree_tests/download_remote_files.py --root-dir pytorch/models |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof this is taking 12 minutes to download: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8546974154/job/23418304591?pr=152#step:7:14
We'll really want that to be cached for presubmit.
] | ||
} | ||
] | ||
} | ||
] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing more of these files, I'm still thinking about how to keep them easy to update. Might refactor to separate JSON files like
test_case_splats.json:
{
"name": "splats",
"runtime_flagfile": "splat_data_flags.txt",
"remote_files": []
}
test_case_real_weights.json:
{
"name": "real_weights",
"runtime_flagfile": "real_weights_data_flags.txt",
"remote_files": [
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.0.bin",
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.1.bin",
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.2.bin",
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_input.3.bin",
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_output.0.bin",
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/inference_output.1.bin",
"https://sharkpublic.blob.core.windows.net/sharkpublic/sai/sdxl-prompt-encoder/real_weights.irpa"
]
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yeah, this is probably easier to decode/update
.github/workflows/test_iree.yml
Outdated
pytest iree_tests -n auto -k real_weights -rpfE --timeout=600 --retries 2 --retry-delay 5 --durations=0 | ||
pytest iree_tests/pytorch/models -s -n auto -k real_weights -rpfE --timeout=1200 --retries 2 --retry-delay 5 --durations=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't want -n auto
. That's using 64 workers: https://github.com/nod-ai/SHARK-TestSuite/actions/runs/8546974154/job/23418304591?pr=152#step:8:23
It miiiight work out, but multiple large program compiles will compete for the CPU and multiple large program runs will compete for the GPU. I guess ideally the compiler could be using the CPU while the runtime is using the GPU. We also need to be very careful with that for benchmarks - we don't want one benchmark influencing another.
One thing we can do is have multi-stage builds where a CPU machine runs the compiler and a GPU machine runs the tests/benchmarks. IREE's in-tree benchmarks follow that model. As long as the large model weights are externalized into parameter files and the .vmfb files are small, we can limit where we upload/download large files to just where they are needed. If we want to get really fancy, we could spin up GPU machines at the same time as CPU machines and have the GPU machines start to fetch weights (orrr just have them cached already) while the CPU machines work on compiling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I removed -n auto
on the benchmarking part for that reason. But, sure, we can remove it here also and figure out how we want to proceed in the future. A pipeline on the runner side of things would be cool. The cpu machines compile and pass the vmfbs to the gpu runners to run, but also might just be making things complicated. I'm thinking in terms of how long the CI run will take, instead of 64, do you think lowering it into 4 or something would work? (or is that also too much competition)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I've been using 4.
@ScottTodd is this good to merge now? |
This commit adds the V0 support for testing all sdxl submodels. This PR is not a long term solution for benchmarking as Scott and I discussed here: #152. This is the result of a request to get sdxl benchmarking in ASAP by our team. Due to the high priority for this to be added to the sdxl testing as our team lands patches in IREE, this is simply just to get the implementation in and working. Scott and I discussed some more intensive and well structured ways to add benchmarking, which either of us may implement in the future. Also, this PR depends on this one in terms of landing: #152 (hence the CI failure) Notes for future if we decide that we need a stronger implementation: 1. Maybe something like iree-org/iree#16965 which will feed into https://perf.iree.dev/. 2. This is the benchmarking framework we already have: https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks 3. Some questions Scott raised to keep in mind for future implementation: * What metrics/artifacts do we want from benchmarking? * Each model in isolation? Full pipeline latency? Just dispatch time? * What do we want done with benchmark results / artifacts? * The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries * Where do we want benchmarks to run? * Right after tests, on presubmit to IREE? * In a separate job, on separate runners? If we decide benchmarking needs changes, we will address all of these and come up with a more structured, methodical implementation that either creates a new benchmarking flow here or plugs into the iree benchmarking setup.
This commit adds support for testing all sdxl submodels
This commit adds the V0 support for testing all sdxl submodels. This PR is not a long term solution for benchmarking as Scott and I discussed here: #152. This is the result of a request to get sdxl benchmarking in ASAP by our team. Due to the high priority for this to be added to the sdxl testing as our team lands patches in IREE, this is simply just to get the implementation in and working. Scott and I discussed some more intensive and well structured ways to add benchmarking, which either of us may implement in the future. Also, this PR depends on this one in terms of landing: #152 (hence the CI failure) Notes for future if we decide that we need a stronger implementation: 1. Maybe something like iree-org/iree#16965 which will feed into https://perf.iree.dev/. 2. This is the benchmarking framework we already have: https://github.com/openxla/iree-comparative-benchmark and https://github.com/openxla/iree/tree/main/build_tools/benchmarks 3. Some questions Scott raised to keep in mind for future implementation: * What metrics/artifacts do we want from benchmarking? * Each model in isolation? Full pipeline latency? Just dispatch time? * What do we want done with benchmark results / artifacts? * The in-tree benchmarks in IREE submit results to a dashboard (that should use a queryable database...), upload Tracy files to cloud storage, and comment on pending pull requests with results summaries * Where do we want benchmarks to run? * Right after tests, on presubmit to IREE? * In a separate job, on separate runners? If we decide benchmarking needs changes, we will address all of these and come up with a more structured, methodical implementation that either creates a new benchmarking flow here or plugs into the iree benchmarking setup.
These aren't currently used in either this repository or [iree-org/iree](https://github.com/iree-org/iree). https://github.com/nod-ai/SHARK-TestSuite/blob/25ba2648b76cca75733a435c59d5224e3f397e3b/.github/workflows/test_iree.yml#L82-L89 In fact, I'm not sure if these were ever used? They were branched from other files in #152.
This commit adds support for testing all sdxl submodels