Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic selection of extended instructions sets at runtime for x86 architecture #1261

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

didzis
Copy link
Contributor

@didzis didzis commented Sep 8, 2023

This PR aims to solve the problem of distributing a single, highly performant binary executable to x86 target machines by dynamically identifying the available extended instruction sets at runtime. With this approach, there's no need to compromise between delivering optimal performance on each target machine (having to select only the common shared instruction sets) and ensuring broad distribution across as many target machines as possible without encountering illegal instruction errors.

Unfortunately, due to some limitations, the solution is not very clean. Specifically, there are copies of the same functionality for different instruction sets, as well as combinations of instruction sets, to achieve the best performance in the context of whisper.cpp.

However, I believe this might offer significant practical value for some. Therefore, I've submitted this PR for your consideration.

@Macoron
Copy link
Contributor

Macoron commented Sep 13, 2023

I didn't review the code implementation, but the feature is really important for shipping build apps for end-customers. Great work, hope it will get merged!

@didzis didzis force-pushed the dynamic-arch branch 3 times, most recently from 46cfc26 to add74c2 Compare September 30, 2023 21:53
@z11h
Copy link

z11h commented Nov 6, 2023

Any update on getting this merged?

…86 architecture

This avoids illegal instruction set error when moving compiled binary
between x86 machines but retains almost full performance available for
each particular target machine.

Implemented employing per-function selection of target instruction sets
(sse3, ssse3, f16c, fma, avx, avx2, avx512). To alleviate application of
of this patch, some parts are moved out to a separate header file. Due
to certain limitations and to keep max performance in each setting,
there is a degree of code redundancy.

There still might be few runtime overheads because of the dynamic
selection of optimized function which might disable inlining. To reduce
the possible performance hit there are optimized functions introduced
for some combinations of instruction sets.

Enable runtime selection of extended instruction sets by setting
DYNAMIC_ARCH directive.

Works on Linux and macOS (x86_64) using GCC and Clang compilers.

CPU feature detection is placed into separate file, but can be easily
inlined into ggml.c
@ngxson
Copy link
Contributor

ngxson commented Sep 7, 2024

Hi, your implementation of cpu_features.c looks very good. Do you need help to finish this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants