Use _mm512_popcnt_epi64 to speedup hamming distance evaluation. #4020

mulugetam · 2024-11-06T22:45:52Z

The _mm512_popcnt_epi64 intrinsic is used to accelerate Hamming distance calculations in HammingComputerDefault and HammingComputer64.

Benchmarking with bench_hamming_computer on AWS r7i instance shows a performance improvement of up to 30% compared to AVX-2.

This PR depends on PR#4025

alexanderguzhva · 2024-11-07T15:26:41Z

@mulugetam Sure, _mm512_popcnt_epi64() requires VPOPCNTDQ (at least Intel IceLake CPUs), but Faiss AVX512 for Linux builds defaults to F, CD, VL, DQ, BW extensions (https://github.com/facebookresearch/faiss/blob/main/faiss/CMakeLists.txt). So, the performance gain from this PR may be seen only if someone builds Faiss by himself using an MSVC compiler.

mengdilin · 2024-11-07T18:43:19Z

@mulugetam FYI we are also working with other Intel collaborators (@guangzegu and @xtangxtang) to incorporate AMX to FAISS from #3266 @asadoughi suggested perhaps we should have a meeting and sync up on the collaboration with Intel

mengdilin · 2024-11-07T19:00:15Z

Thanks for your contributions :) Question: What is the performance for Hamming distance calculations when VPOPCNTDQ is not supported compared to AVX2?

mulugetam · 2024-11-07T21:52:38Z

@alexanderguzhva

VPOPCNTDQ

Any downside in using the scheme below? I check if the compiler and the underlying machine support VPOPCNTDQ and then add the -mavx512vpopcntdq flag to the target_compile_options based on that.

function(check_avx512vpopcntdq_support result_var)
    set(TEST_PROGRAM "${CMAKE_BINARY_DIR}/check_avx512vpopcntdq.cpp")

    file(WRITE ${TEST_PROGRAM} "
    int main() {
        #ifdef __has_builtin
            #if __has_builtin(__builtin_cpu_supports)
                if (__builtin_cpu_supports(\"avx512vpopcntdq\")) {
                    return 0;
                }
            #endif
        #endif
        return 1;
    }
    ")

    try_run(RUN_RESULT COMPILE_RESULT
            ${CMAKE_BINARY_DIR}
            ${TEST_PROGRAM})

    if(RUN_RESULT EQUAL 0)
        set(${result_var} TRUE PARENT_SCOPE)
    else()
        set(${result_var} FALSE PARENT_SCOPE)
    endif()
endfunction()

include(CheckCXXCompilerFlag)
check_cxx_compiler_flag("-mavx512vpopcntdq" COMPILER_SUPPORTS_AVX512VPOPCNTDQ)

if(COMPILER_SUPPORTS_AVX512VPOPCNTDQ)
        check_avx512vpopcntdq_support(AVX512VPOPCNTDQ_SUPPORT)
        add_definitions(-DHAS_AVX512VPOPCNTDQ_SUPPORT)
endif()

// Line 259 gets modified
if(HAS_AVX512VPOPCNTDQ_SUPPORT)
    target_compile_options(faiss_avx512 PRIVATE
        $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -mfma -mf16c -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mpopcnt -mavx512vpopcntdq>)
else()
    target_compile_options(faiss_avx512 PRIVATE
        $<$<COMPILE_LANGUAGE:CXX>:-mavx2 -mfma -mf16c -mavx512f -mavx512cd -mavx512vl -mavx512dq -mavx512bw -mpopcnt>)
endif()

mulugetam · 2024-11-07T22:12:34Z

Thanks for your contributions :) Question: What is the performance for Hamming distance calculations when VPOPCNTDQ is not supported compared to AVX2?

@mengdilin If I benchmark the AVX-512 build without VPOPCNTDQ and compare it to the AVX2 build, I’d see about a 1% - 5% speedup (depending on the code_size) for AVX-512.

mulugetam · 2024-11-07T22:24:46Z

@mulugetam FYI we are also working with other Intel collaborators (@guangzegu and @xtangxtang) to incorporate AMX to FAISS from #3266 @asadoughi suggested perhaps we should have a meeting and sync up on the collaboration with Intel

@mengdilin
That's nice. I'm approaching this from an OpenSearch perspective, where FAISS is used as one of the engines. We've observed significant performance improvements in vector indexing and search with our previous PR, which added AVX-512 support for ScalarQuantizer.

alexanderguzhva · 2024-11-07T23:15:10Z

@mulugetam Technically, cmake-based check for vpopcntdq may lead to the following situation. Say, a build is performed on a most recent CPU get, and then the built python package is put into conda / pip repository. This package gets spread across the world and leads to problems for all ppl who has Intel Cascade Lake CPUs.
More preferrable solution at this stage would be to add a new architecture mode (currently, there are 'generic', 'avx2' and 'avx512'), it is up to Faiss team to decide whether to accept another arch.
Also, it is possible to have dynamic kernel selection via function pointers, but the kernel is tiny (function pointer overhead will be huge), and Faiss does not currently support such an approach.

mulugetam · 2024-11-08T00:13:22Z

@mulugetam Technically, cmake-based check for vpopcntdq may lead to the following situation. Say, a build is performed on a most recent CPU get, and then the built python package is put into conda / pip repository. This package gets spread across the world and leads to problems for all ppl who has Intel Cascade Lake CPUs. More preferrable solution at this stage would be to add a new architecture mode (currently, there are 'generic', 'avx2' and 'avx512'), it is up to Faiss team to decide whether to accept another arch. Also, it is possible to have dynamic kernel selection via function pointers, but the kernel is tiny (function pointer overhead will be huge), and Faiss does not currently support such an approach.

@alexanderguzhva Thanks! In addition to VPOPCNTDQ, we plan to make additional PRs to speed up the scalar quantizer with FP16 instructions. I will modify this PR and introduce an 'avx512_advanced' mode. Hopefully, the team will accept the changes, given that it offers a performance boost.

alexanderguzhva · 2024-11-08T16:26:46Z

@mulugetam FP16 means avx512fp16 or f16c?

mulugetam · 2024-11-08T17:15:25Z

@alexanderguzhva avx512fp16.

https://networkbuilders.intel.com/solutionslibrary/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide

alexanderguzhva · 2024-11-08T19:00:12Z

@mulugetam Well, there are two more problems then.

fp16 is not supported on AMD, so this means that the proposed arch should be called something like avx512-sr. Which is beneficial in case if FAISS finally gets AMX support, bcz AMX also requires SR as a minimum.
fp16 might lack the needed precision range. As of now, FP16 version of Faiss ScalarQuantizer assumes that the data is stored in fp16, but computations are performed in fp32. Performing pure computations in fp16 (say, via _mm512_fmadd_ph()) might produce out-of-range problems for the final result. So, I'd recommend adding some specific flag to a ScalarQuantizer class that lets it use pure fp16-based computations.

Nevertheless, please write and benchmark the code, and then we'll decide how to integrate it into Faiss properly. :) Thanks

mulugetam · 2024-11-12T17:21:12Z

@alexanderguzhva @mengdilin I have created new PR#4025 that adds 'avx512-sr' architecture mode and marked this PR to be dependent on it.

Summary: This PR adds a new architecture mode to support the new extensions to AVX512, namely [AVX512-FP16](https://networkbuilders.intel.com/solutionslibrary/intel-avx-512-fp16-instruction-set-for-intel-xeon-processor-based-products-technology-guide), which have been available since Intel® Sapphire Rapids. This PR is a prerequisite for [PR#4020](#4020) that speeds up hamming distance evaluations. Pull Request resolved: #4025 Reviewed By: pankajsingh88 Differential Revision: D67524575 Pulled By: mengdilin fbshipit-source-id: f3a09943b062d720b241f95aef2f390923ffd779

mengdilin · 2024-12-23T17:30:29Z

@mulugetam the dependent PR was just merged. Do you mind rebasing this PR on top to kick off the new CI?

Signed-off-by: Mulugeta Mammo <[email protected]>

mulugetam · 2024-12-23T22:47:56Z

Thank you @mengdilin

facebook-github-bot added the CLA Signed label Nov 6, 2024

mengdilin added the Performance label Nov 7, 2024

mulugetam mentioned this pull request Nov 12, 2024

Add a new architecture mode: 'avx512_spr'. #4025

Closed

Use _mm512_popcnt_epi64 to speedup hamming distance evaluation.

5a9a98f

Signed-off-by: Mulugeta Mammo <[email protected]>

mulugetam force-pushed the avx512_for_hamming_distance branch from 953112e to 5a9a98f Compare December 23, 2024 21:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use _mm512_popcnt_epi64 to speedup hamming distance evaluation. #4020

Use _mm512_popcnt_epi64 to speedup hamming distance evaluation. #4020

mulugetam commented Nov 6, 2024 •

edited

Loading

alexanderguzhva commented Nov 7, 2024

mengdilin commented Nov 7, 2024

mengdilin commented Nov 7, 2024

mulugetam commented Nov 7, 2024

mulugetam commented Nov 7, 2024

mulugetam commented Nov 7, 2024

alexanderguzhva commented Nov 7, 2024 •

edited

Loading

mulugetam commented Nov 8, 2024

alexanderguzhva commented Nov 8, 2024

mulugetam commented Nov 8, 2024

alexanderguzhva commented Nov 8, 2024

mulugetam commented Nov 12, 2024

mengdilin commented Dec 23, 2024

mulugetam commented Dec 23, 2024

Use _mm512_popcnt_epi64 to speedup hamming distance evaluation. #4020

Are you sure you want to change the base?

Use _mm512_popcnt_epi64 to speedup hamming distance evaluation. #4020

Conversation

mulugetam commented Nov 6, 2024 • edited Loading

alexanderguzhva commented Nov 7, 2024

mengdilin commented Nov 7, 2024

mengdilin commented Nov 7, 2024

mulugetam commented Nov 7, 2024

mulugetam commented Nov 7, 2024

mulugetam commented Nov 7, 2024

alexanderguzhva commented Nov 7, 2024 • edited Loading

mulugetam commented Nov 8, 2024

alexanderguzhva commented Nov 8, 2024

mulugetam commented Nov 8, 2024

alexanderguzhva commented Nov 8, 2024

mulugetam commented Nov 12, 2024

mengdilin commented Dec 23, 2024

mulugetam commented Dec 23, 2024

mulugetam commented Nov 6, 2024 •

edited

Loading

alexanderguzhva commented Nov 7, 2024 •

edited

Loading