Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: gpu: intel: jit: conv: add reorder-based precomputed zero points #2267

Merged
merged 2 commits into from
Jan 10, 2025

Conversation

hidefromkgb
Copy link
Contributor

@hidefromkgb hidefromkgb commented Dec 13, 2024

Description

This is a novel approach to calculating zero points ahead of time, appending the precalc results to the convolutions weights buffers when they are either uploaded to the GPU or reordered on the GPU to fit the layout requested by the conv primitive.

Perf results on YOLO-v8n:

FP16, ms/iter INT8+ZP, ms/iter INT8 no ZP, ms/iter
ATSM/DG2 3.3633792 2.6763831 2.4270376
BMG 1.46378 1.42166 1.2961

Addresses MFDNN-12548.

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

@hidefromkgb hidefromkgb requested review from a team as code owners December 13, 2024 19:37
@github-actions github-actions bot added platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic labels Dec 13, 2024
Copy link
Contributor

@dzarukin dzarukin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we don't compute this partial summation on the library side but instead provide an API to request compensations and accept them as a per-channel array?

src/common/memory_desc.cpp Show resolved Hide resolved
src/common/memory_desc.hpp Show resolved Hide resolved
src/gpu/gpu_reorder_pd.hpp Outdated Show resolved Hide resolved
src/gpu/gpu_reorder_pd.hpp Outdated Show resolved Hide resolved
src/gpu/gpu_reorder_pd.hpp Outdated Show resolved Hide resolved
src/gpu/gpu_utils.hpp Outdated Show resolved Hide resolved
@hidefromkgb
Copy link
Contributor Author

What if we don't compute this partial summation on the library side but instead provide an API to request compensations and accept them as a per-channel array?

An important remark: this is not just a per-channel array but a full-fledged hidden buffer with DHW spatials.
Still, it's definitely possible, in theory, to expose ZP precompute as a part of oneDNN API — but maybe not right now.

In fact, this would be more appropriate from the architectural standpoint to expose rather than hide, as the current approach is all about tackling a graph-level problem with primitive-level tools.
Furthermore it'd lift the scalar-ZP-only requirement — a serious limitation the proposed PR has.

On the downside, though, the users would then become responsible for allocating network nodes to handle the ahead-of-time ZP precalculation.

@karturov @dzarukin What do you think of this?

@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch 4 times, most recently from 2a37b6b to b74e8bb Compare December 17, 2024 06:17
src/common/memory_desc_wrapper.hpp Show resolved Hide resolved
src/common/primitive_hashing.cpp Show resolved Hide resolved
src/gpu/intel/jit/reorder/gen_reorder.cpp Outdated Show resolved Hide resolved
@echeresh
Copy link
Contributor

In fact, this would be more appropriate from the architectural standpoint to expose rather than hide, as the current approach is all about tackling a graph-level problem with primitive-level tools.
Furthermore it'd lift the scalar-ZP-only requirement — a serious limitation the proposed PR has.

On the downside, though, the users would then become responsible for allocating network nodes to handle the ahead-of-time ZP precalculation.

@hidefromkgb What do you mean exactly by "expose"? Exposing the structure of the memory_extra_desc_t and its contents? If so, what is the potential benefit?

I assume we would still provide some API (as e.g. reorder) to compute a compensation buffer in this form. It can't be easily/efficiently pre-computed by the user as it's not really trivial - so we have to expose it from the library in some form, say now it's done under reorder. Then I would say - it's not really important for the users to have any details on the buffer contents as the flow is 1) allocate an extra buffer 2) call reorder (or whatever) primitive 3) pass this buffer to convolution. There is no need to expose more details.

Furthermore it'd lift the scalar-ZP-only requirement — a serious limitation the proposed PR has.

This can be implemented under the current design, right? The user can pass vector zero-points as an additional reorder argument - then the resulting compensation buffer would be computed accordingly. This should require only minor API changes at worst - something like an extra attribute for the convolution for the user to pass to opt in for this feature.

@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch 5 times, most recently from f43badc to 96023e3 Compare January 3, 2025 17:31
@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch 4 times, most recently from e09deff to 2249844 Compare January 7, 2025 21:43
@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch from 2249844 to 04e5d54 Compare January 7, 2025 23:39
src/common/memory_desc.hpp Outdated Show resolved Hide resolved
src/common/memory_desc.cpp Outdated Show resolved Hide resolved
src/gpu/gpu_reorder_pd.hpp Outdated Show resolved Hide resolved
src/gpu/gpu_zero_points_conv.hpp Outdated Show resolved Hide resolved
@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch 3 times, most recently from 6cd8e9a to d2fb32b Compare January 9, 2025 07:30
@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch from d2fb32b to b49f67b Compare January 9, 2025 21:23
@hidefromkgb hidefromkgb force-pushed the aguskov/reorder_zp_precomp branch from b49f67b to ce80081 Compare January 10, 2025 02:00
@hidefromkgb
Copy link
Contributor Author

make test
disable benchdnn_all
enable benchdnn_nightly
enable benchdnn_conv
enable benchdnn_deconv
enable benchdnn_pool
enable benchdnn_reorder
enable benchdnn_rnn

@hidefromkgb
Copy link
Contributor Author

make test perf-gpu
set primitive=conv

@hidefromkgb hidefromkgb merged commit 4d5ec0f into main Jan 10, 2025
16 of 19 checks passed
@hidefromkgb hidefromkgb deleted the aguskov/reorder_zp_precomp branch January 10, 2025 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-x64 Intel64/AMD64 processors. Codeowner: @oneapi-src/onednn-cpu-x64 platform:gpu-generic Codeowner: @oneapi-src/onednn-gpu-generic platform:gpu-intel Codeowner: @oneapi-src/onednn-gpu-intel
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants