Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

vshampor · 2024-10-16T13:04:23Z

Tickets:
153783

slyalin · 2024-10-28T07:13:59Z

src/core/src/op/paged_attention.cpp

+                          get_input_size() == 15,
+                          "PagedAttensionExtension expects 15 inputs, but it has ",


Doesn't look as optional inputs. According to the spec they could be omitted. If you replace by get_input_size() == 13 || get_input_size() == 15 it wouldn't be a big code modification but unlock a bit of flexibility in the transition period where various mixes of main ov and genai may happen. As we keep PA op internal and not very particular on op version numbering, then a bit of backward compatibility care would be nice.

Done, but the alibi parameter doesn't seem to follow that approach

slyalin · 2024-10-28T07:17:35Z

...mon/transformations/src/transformations/sdpa_to_paged_attention/state_management_pattern.cpp

+            pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
+            pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));


If you make these inputs really optional, these two lines are not required.

slyalin · 2024-10-28T07:19:57Z

src/core/src/op/paged_attention.cpp

+            get_input_partial_shape(13).rank().is_dynamic() ||
+            get_input_partial_shape(13).rank().get_length() == 0 ||
+            get_input_partial_shape(13).rank().get_length() == 1,
+            "Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",


Suggested change

"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",

"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

"Empty" shape means [0] here, which have rank 1.

slyalin · 2024-10-28T07:20:39Z

src/core/src/op/paged_attention.cpp

+    NODE_VALIDATION_CHECK(
+            this,
+            get_input_partial_shape(13).rank().is_dynamic() ||
+            get_input_partial_shape(13).rank().get_length() == 0 ||


Suggested change

get_input_partial_shape(13).rank().get_length() == 0 ||

slyalin · 2024-10-28T07:23:08Z

src/core/src/op/paged_attention.cpp

+            get_input_partial_shape(14).rank().get_length() == 0 ||
+            get_input_partial_shape(14).rank().get_length() == 1,
+            "Input `rotated_block_indices` should either have an empty shape or rank 1 but it has rank ",


The same comment are applicable here as for input 13 above.

slyalin · 2024-10-28T07:27:03Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -1576,6 +1591,11 @@ struct AttentionExecutor : public PagedAttentionExecutor {
        if (alibi_slopes) {
            alibi_slopes.assert_dims({H});
        }
+
+        if (rotated_block_indices) {
+            // Rotation, and cache eviction, is limited to cases when Q, K and V embedding sizes are equal, e.g. S == Sv


We already have cases where they are not: minicpm-3

Removed - realized that we don't need that limitation for cache rotation since we only rotate the K values

slyalin · 2024-10-28T07:30:07Z

src/plugins/intel_gpu/src/plugin/ops/paged_attention.cpp

@@ -58,6 +59,10 @@ static void CreatePagedAttentionExtensionOp(ProgramBuilder& p, const std::shared
    OPENVINO_ASSERT(alibi_const != nullptr);
    prim.has_alibi = ov::shape_size(alibi_const->get_output_shape(0)) > 0;

+    std::shared_ptr<ov::op::v0::Constant> rotation_coefficients_const = std::dynamic_pointer_cast<ov::op::v0::Constant>(op->get_input_node_shared_ptr(rotation_coefficients_idx));
+    OPENVINO_ASSERT(rotation_coefficients_const != nullptr);
+    prim.has_rotation_coefficients = ov::shape_size(alibi_const->get_output_shape(0)) > 0;


alibi_const shouldn't be used here -- bad copy&paste?

Fixed, thanks.

dmitry-gorokhov · 2024-11-15T07:52:52Z

@luo-cheng2021 Please review CPU PA changes.

luo-cheng2021 · 2024-11-18T03:14:23Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/common.hpp

+#if defined(HAVE_AVX2) || defined(HAVE_AVX512F)
+    inline __m128i get_8bit_tail_mask_for_16bit_elts(size_t num_16bit_tail_elts) {
+        // num_tail_elts may take from 0 to 8
+        static __m128i masks[] = {


Please do not use static __m128i which will cause a runtime initialization, may use int8_t [][16] instead.

Done, fixed also for get_mask() below

luo-cheng2021 · 2024-11-18T03:19:50Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -9,6 +9,8 @@
 #include <limits>
 #include <type_traits>

+#include <csignal>


luo-cheng2021 · 2024-11-18T03:20:35Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -1137,31 +1141,60 @@ struct MHAHelper {
                cvt_copy(output_emb.ptr<DATA_TYPE>(pq, h * _SV), _output.ptr<float>(ithr, pq, h), _SV);
    }

+


Redundant blank line.

Formatted with clang-tidy

luo-cheng2021 · 2024-11-18T04:28:20Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

@@ -769,6 +772,7 @@ static void pack_32NxK(float* dst, T* src, float* tmp, size_t N, size_t K, size_
    OPENVINO_THROW("pack_32NxK: should not be called.");
 }

+


Formatted with clang-tidy

luo-cheng2021 · 2024-11-18T04:31:11Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

+            float* rotation_coefficient_block_data = rotation_coefficients.ptr<float>() + i * head_chunk_size;
+            KVCACHE_TYPE* cache_block_ptr = key_cache.ptr<KVCACHE_TYPE>(rotated_block_index);
+            rotate_kv_cache_block(cache_block_ptr, rotation_coefficient_block_data, num_heads, _block_size, embedding_size);
+


Redundant blank line.

Formatted with clang-tidy

luo-cheng2021 · 2024-11-18T06:54:51Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+                CT cache_value_1 = *cache_value_1_ptr;
+
+                *cache_value_0_ptr = cache_value_0 * rotation_value_cos - cache_value_1 * rotation_value_sin;
+                *cache_value_1_ptr = cache_value_0 * rotation_value_sin + cache_value_1 * rotation_value_cos;


Is the algorithm same with the following code?

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Lines 158 to 161 in c4d6d2b

auto src0 = src[i];

auto src1 = src[i + half_rotary_dims];

dst[i] = cos[i] * src0 - sin[i] * src1;

dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

If so, the following code can be used as reference:

openvino/src/plugins/intel_cpu/src/nodes/rope.cpp

Lines 35 to 102 in c4d6d2b

static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {

std::shared_ptr<kernel::JitKernelBase> res;

MAYBE_UNUSED(param);

MAYBE_UNUSED(check_vec_size2);

#if defined(OPENVINO_ARCH_X86_64)

if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {

bool flag = true;

if (check_vec_size2) {

auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;

if (param.rotary_ndims % (vec_size * 2) != 0)

flag = false;

}

if (flag)

res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);

} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {

bool flag = true;

if (check_vec_size2) {

auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;

if (param.rotary_ndims % (vec_size * 2) != 0)

flag = false;

}

if (flag)

res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);

}

if (res)

res->create_kernel();

#endif // OPENVINO_ARCH_X86_64

return res;

}

static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {

MAYBE_UNUSED(ker);

MAYBE_UNUSED(src);

MAYBE_UNUSED(dst);

MAYBE_UNUSED(cos);

MAYBE_UNUSED(sin);

#if defined(OPENVINO_ARCH_X86_64)

jit_rotary_call_args call_args;

call_args.src = src;

call_args.cos = cos;

call_args.sin = sin;

call_args.dst = dst;

(*ker)(&call_args);

#endif // OPENVINO_ARCH_X86_64

}

template <typename T>

struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {

const op::internal::RoPE::Config& m_config;

std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;

RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {

jit_rotary_compile_params jcp;

jcp.src_prc = precision_of<T>::value;

jcp.dst_prc = precision_of<T>::value;

jcp.rotary_ndims = config.rotary_ndims;

jcp.interleave = false;

m_rotaryKernel = createJitKernel(jcp);

}

I have already written and tested my implementation, besides, the code you've sent me probably cannot be reused without modifications or bulky instantiations.

luo-cheng2021 · 2024-11-18T07:03:26Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/cache_rotation.hpp

+
+template<class CT>
+inline static void rotate_kv_cache_block_hw(CT* cache_block_ptr, float* block_rotation_coefficients_ptr, size_t num_heads, size_t block_size, size_t embedding_size) {
+#if !defined(HAVE_AVX2) && !defined(HAVE_AVX512F)


It should be cleaner if rotate_kv_cache_block_hw and rotate_kv_cache_block_sw are merged and let the rotate_kv_cache_chunk_xxx to handle the tails.

I need HW and SW available as separate functions for testing purposes.

luo-cheng2021 · 2024-11-18T07:08:07Z

src/plugins/intel_cpu/src/nodes/kernels/scaled_attn/executor_pa.cpp

        auto B = past_lens.size(0);
        auto q_len = query.size(2);
        auto kv_len_in_blocks = div_up(max_context_len, _block_size);

        // aligned to cache line (64bytes=16*sizeof(float)) to avoid false sharing
        _weight_bhl.resize<float>({B, _H, q_len, rnd_up(max_context_len, std::max(_block_size, size_t{16}))});

+        // TODO (vshampor): implement cache rotation at this spot
+        if (rotation_coefficients) {


The logic should be also added for the more often used path exec_loop_mixed.

Added, thanks!

github-actions bot added category: Core OpenVINO Core (aka ngraph) category: GPU OpenVINO GPU plugin category: CPU OpenVINO CPU plugin category: transformations OpenVINO Runtime library - Transformations category: CPP API OpenVINO CPP API bindings labels Oct 16, 2024

vshampor force-pushed the token_rotation branch from 2a172b2 to c071571 Compare October 26, 2024 00:52

slyalin requested changes Oct 28, 2024

View reviewed changes

github-actions bot added the category: build OpenVINO cmake script / infra label Oct 30, 2024

vshampor changed the title ~~Add cache rotation inputs~~ Add cache rotation inputs and CPU kernel implementation for cache rotation Nov 12, 2024

vshampor mentioned this pull request Nov 12, 2024

Token rotation openvinotoolkit/openvino.genai#987

Open

vshampor force-pushed the token_rotation branch from d90e212 to ed46cfe Compare November 12, 2024 20:30

dmitry-gorokhov assigned luo-cheng2021 Nov 15, 2024

luo-cheng2021 requested changes Nov 18, 2024

View reviewed changes

luo-cheng2021 reviewed Nov 18, 2024

View reviewed changes

vshampor requested a review from slyalin November 18, 2024 10:11

vshampor force-pushed the token_rotation branch from afa851c to 8a355f3 Compare November 18, 2024 18:10

github-actions bot added category: Python API OpenVINO Python bindings category: TF FE OpenVINO TensorFlow FrontEnd category: PyTorch FE OpenVINO PyTorch Frontend category: JAX FE OpenVINO JAX FrontEnd labels Nov 18, 2024

Add cache rotation inputs in transformations, CPU and GPU plugins

a33f255

vshampor force-pushed the token_rotation branch from 8a355f3 to a33f255 Compare November 19, 2024 10:13

github-actions bot added category: CI OpenVINO public CI github_actions Pull requests that update GitHub Actions code category: NPU OpenVINO NPU plugin labels Nov 19, 2024

Fix warnings

a0818c2

vshampor force-pushed the token_rotation branch from 138de47 to a0818c2 Compare November 19, 2024 12:12

vshampor requested a review from luo-cheng2021 November 19, 2024 12:20

Remove diag message

12079b4

vshampor marked this pull request as ready for review November 19, 2024 12:22

vshampor requested review from a team as code owners November 19, 2024 12:22

vshampor requested review from itikhono and removed request for a team November 19, 2024 12:22

vshampor added 3 commits November 19, 2024 14:09

Fix more warnings

d4f31a0

Compile stub test if not on x86

1fb5c80

Use 16-core executor

cfc777b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

vshampor commented Oct 16, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024 •

edited

Loading

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

slyalin Oct 28, 2024 •

edited

Loading

vshampor Oct 30, 2024

slyalin Oct 28, 2024

vshampor Oct 30, 2024

dmitry-gorokhov commented Nov 15, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

luo-cheng2021 Nov 18, 2024

vshampor Nov 19, 2024

		get_input_size() == 15,
		"PagedAttensionExtension expects 15 inputs, but it has ",

		pa_arguments.insert(pa_arguments.begin() + 13, v0::Constant::create(element::f32, Shape{0}, {}));
		pa_arguments.insert(pa_arguments.begin() + 14, v0::Constant::create(element::i32, Shape{0}, {}));

	"Input `rotation_coefficients` should either have an empty shape or rank 1, but it has rank ",
	"Input `rotation_coefficients` should either have rank 1 or omitted, but it has rank ",

		@@ -1137,31 +1141,60 @@ struct MHAHelper {
		cvt_copy(output_emb.ptr<DATA_TYPE>(pq, h * _SV), _output.ptr<float>(ithr, pq, h), _SV);
		}

		@@ -769,6 +772,7 @@ static void pack_32NxK(float* dst, T* src, float* tmp, size_t N, size_t K, size_
		OPENVINO_THROW("pack_32NxK: should not be called.");
		}

	auto src0 = src[i];
	auto src1 = src[i + half_rotary_dims];
	dst[i] = cos[i] * src0 - sin[i] * src1;
	dst[i + half_rotary_dims] = cos[i + half_rotary_dims] * src1 + sin[i + half_rotary_dims] * src0;

	static std::shared_ptr<kernel::JitKernelBase> createJitKernel(const jit_rotary_compile_params& param, bool check_vec_size2 = false) {
	std::shared_ptr<kernel::JitKernelBase> res;

	MAYBE_UNUSED(param);
	MAYBE_UNUSED(check_vec_size2);

	#if defined(OPENVINO_ARCH_X86_64)

	if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx512_core)) {
	bool flag = true;
	if (check_vec_size2) {
	auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>::vec_size;
	if (param.rotary_ndims % (vec_size * 2) != 0)
	flag = false;
	}
	if (flag)
	res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx512_core>>(param);
	} else if (dnnl::impl::cpu::x64::mayiuse(dnnl::impl::cpu::x64::avx2)) {
	bool flag = true;
	if (check_vec_size2) {
	auto vec_size = jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>::vec_size;
	if (param.rotary_ndims % (vec_size * 2) != 0)
	flag = false;
	}
	if (flag)
	res = std::make_shared<jit_rotary_kernel<dnnl::impl::cpu::x64::avx2>>(param);
	}

	if (res)
	res->create_kernel();

	#endif // OPENVINO_ARCH_X86_64

	return res;
	}

	static void execJitKernel(const std::shared_ptr<kernel::JitKernelBase>& ker, const void* src, void* dst, const float* cos, const float* sin) {
	MAYBE_UNUSED(ker);
	MAYBE_UNUSED(src);
	MAYBE_UNUSED(dst);
	MAYBE_UNUSED(cos);
	MAYBE_UNUSED(sin);

	#if defined(OPENVINO_ARCH_X86_64)

	jit_rotary_call_args call_args;
	call_args.src = src;
	call_args.cos = cos;
	call_args.sin = sin;
	call_args.dst = dst;
	(*ker)(&call_args);

	#endif // OPENVINO_ARCH_X86_64
	}

	template <typename T>
	struct RoPE::RoPEExecutorRotateHalf : public RoPE::Executor {
	const op::internal::RoPE::Config& m_config;
	std::shared_ptr<kernel::JitKernelBase> m_rotaryKernel;

	RoPEExecutorRotateHalf(const op::internal::RoPE::Config& config) : m_config(config) {
	jit_rotary_compile_params jcp;
	jcp.src_prc = precision_of<T>::value;
	jcp.dst_prc = precision_of<T>::value;
	jcp.rotary_ndims = config.rotary_ndims;
	jcp.interleave = false;
	m_rotaryKernel = createJitKernel(jcp);
	}

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Are you sure you want to change the base?

Add cache rotation inputs and CPU kernel implementation for cache rotation #27088

Conversation

vshampor commented Oct 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-gorokhov commented Nov 15, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slyalin Oct 28, 2024 •

edited

Loading

slyalin Oct 28, 2024 •

edited

Loading