New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[onert] Share memory for Reshape, ExapndDims and Squeeze #14057

Draft

mbencer wants to merge 25 commits into Samsung:master from mbencer:mbencer/ReshapeAvoidCopy

Contributor

mbencer commented Sep 23, 2024 •

edited

Loading

This commit extends current tensor memory management infrastructure to allow tensor memory sharing if possible.

ONE-DCO-1.0-Signed-off-by: Mateusz Bencer [email protected]

Issue: #12836

mbencer requested a review from hseok-oh

September 23, 2024 12:10

mbencer mentioned this pull request

[onert] In-place Reshape and ExpandDims operation #12836

Open

glistening reviewed

View reviewed changes

runtime/onert/backend/cpu/KernelGenerator.cc Outdated Show resolved Hide resolved

runtime/onert/backend/cpu/ops/OperationUtils.cc Outdated Show resolved Hide resolved


          [onert] Share memory for Reshape, ExapndDims and Squeeze

c8d8a75

This commit extends current tensor memory management infrastructure to allow tensor memory sharing if possible.

ONE-DCO-1.0-Signed-off-by: Mateusz Bencer [email protected]

mbencer force-pushed the mbencer/ReshapeAvoidCopy branch from c5ddcc5 to c8d8a75 Compare

October 2, 2024 13:28

Contributor Author

mbencer commented Oct 2, 2024

@glistening @hseok-oh Thank you for review of previous version. In the current version I've changed completely approach. Now there memory sharing is processed during tensors allocation.

mbencer requested a review from glistening

October 2, 2024 13:29

mbencer changed the title ~~[onert] Optimize Reshape, ExpandDims and Squeeze~~ [onert] Share memory for Reshape, ExapndDims and Squeeze

mbencer added 12 commits

October 2, 2024 16:33


          fixed trix and xnpack build

7e3bfae


          skip for other backend

e2283b1


          fix skipping not supported backends

ae54f84


          first unit tests

b901aae


          handling additional const input case + more tests

2cbe01d


          styles applied

9ded99d


          remove unnecessary file

0152e12


          added reassign_indexes_to_single_sources

0b2c4ee


          more tests

b634fbf


          test names refactor

132e237


          styles applied

fe371a6


          Merge remote-tracking branch 'upstream/master' into mbencer/ReshapeAv…

1ea3bac

…oidCopy

Contributor

glistening commented Oct 11, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.

cc @hseok-oh, @ragmani

ragmani reviewed

View reviewed changes

Contributor

ragmani left a comment

@mbencer
This seems to be my last review on this work because I will be away from the office for a long time. I'm sorry, I won't be able to give feedback anymore.

runtime/onert/core/src/compiler/ExecutorFactory.cc Outdated

+                      }
+                    }
+                    reassign_indexes_to_single_sources(data.shared_memory_operand_map);
+                  }

Contributor

ragmani Oct 11, 2024

In-place is dependent on specific operations, and the kernel implementations may vary for each backend. Also, the kernel implementations for the specific operations only exist in cpu backend now. So, it would be better to move this map creation into cpu backend.
I think a better place to create and append map is in KernelGenerator. However, currently in cpu backend, registering tensors is called before KernelGenerator, making it difficult to simply implement to move the map into KernelGenerator. You may need to unify BackendContext::genTensors() and BackendContext::genKernels() such as train backend.

Contributor Author

mbencer Oct 15, 2024

I see your point. My intention was to make this mechanism more global but I see that it can be not applicable for other backends.
As you notice it's very problematic to move it to KernelGenerator - we need to pass this information for TensorBuilder ctor (setter seems to be not good approach and to initConsts (here the only possibility seems to be local backend context).
My proposition (already implemented) is to call it in runtime/onert/backend/cpu/Backend.h

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc Outdated

Comment on lines 20 to 21

		#include "GenModelTest.h"
		#include "CircleGen.h"

Contributor

ragmani Oct 11, 2024

Suggested change

      
            #include "GenModelTest.h"
          
            #include "CircleGen.h"
          
            #include "CircleGen.h"
          
            #include "GenModelTest.h"

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc Outdated

+              #include "GenModelTest.h"
+              #include "CircleGen.h"
+              TEST_F(GenModelTest, optimized_reshape_inference)

Contributor

ragmani Oct 11, 2024 •

edited

Loading

I think this test is for reshape test, but not reshape optimization(probably in-place) test. It would be better to rename this test and add tests to verify in-place implementation. If it's difficult to add it tests nnfw_api test, please add gtests in the implemented directory instead.

Contributor Author

mbencer Oct 16, 2024

I see, I've changed name of those tests and introduced in-place tests in https://github.com/Samsung/ONE/pull/14057/files#diff-4071f79a2a4bc3fd841a69a64ada0e01827c2e6202bf7c446dec530aa9f0bc6b

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc Outdated

+                SUCCEED();
+              }
+              TEST_F(GenModelTest, optimized_expand_dims_inference)

Contributor

ragmani Oct 11, 2024

ditto

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc Outdated

+                SUCCEED();
+              }
+              TEST_F(GenModelTest, optimized_squeeze_inference)

Contributor

ragmani Oct 11, 2024

ditto

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc Outdated

+                SUCCEED();
+              }
+              TEST_F(GenModelTest, optimized_reshape_reshape_reshape_chain_inference)

Contributor

ragmani Oct 11, 2024

ditto

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc

+                SUCCEED();
+              }
+              TEST_F(GenModelTest, reshape_input_model_input_inference)

Contributor

ragmani Oct 11, 2024

ditto

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc

+                SUCCEED();
+              }
+              TEST_F(GenModelTest, reshape_input_model_output_inference)

Contributor

ragmani Oct 11, 2024

ditto

tests/nnfw_api/src/GenModelTests/MemorySharingModels.test.cc

+                SUCCEED();
+              }
+              TEST_F(GenModelTest, reshape_output_model_output_inference)

Contributor

ragmani Oct 11, 2024

ditto

runtime/onert/core/src/compiler/ExecutorFactory.cc Outdated

                 // Create contexts
                 auto whole_op_order = lgraph.graph().topolSortOperations();
+                const std::unordered_set<std::string> memory_sharing_supported_backends = {"cpu", "builtin"};

Contributor

ragmani Oct 11, 2024

Please remove builtin backend from this set. The kernels in builtin backend deal with data transmission between other backends, so there is no need to apply in-place for this task. The required in-place optimization there has already been applied in the other way.

Contributor Author

mbencer Oct 16, 2024

I see - I re-wrote implementation to be local for cpu.

mbencer added 2 commits

October 11, 2024 16:11


          claim and release memory improvements

12d33dc


          styles applied

b4e655f

Contributor Author

mbencer commented Oct 11, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.

cc @hseok-oh, @ragmani

@glistening Thank you for response. Sure, I'll try to split the PR but let me introduce review request from @ragmani at first.
I am not sure if splitting based on operator (Reshape, ExpandDims) makes sense here but probably splitting to backend/core (with consideration #14057 (comment)) should make review easier ;)

Contributor Author

mbencer commented Oct 11, 2024

@mbencer This seems to be my last review on this work because I will be away from the office for a long time. I'm sorry, I won't be able to give feedback anymore.

I see. Anyway thank you for very useful feedback! ;)

mbencer added 4 commits

October 15, 2024 09:05


          extract findSharedMemoryOperandsIndexes

14a4c61


          styles applied

c3199da


          added SharedMemoryOperands tests

b19064b


          test name refactor

f0afe39

mbencer added 2 commits

October 16, 2024 11:42


          styles applied

828cd72


          last names refactor

44244d1

mbencer requested a review from ragmani

October 16, 2024 10:17

mbencer added 3 commits

October 16, 2024 20:23


          names refactor

fc4b1c9


          styles applied

50036d0


          Merge remote-tracking branch 'upstream/master' into mbencer/ReshapeAv…

c790a5e

…oidCopy

Contributor Author

mbencer commented Oct 16, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.

cc @hseok-oh, @ragmani

I've split part of the implementation into smaller PRs:

PR link	description
#14227	[onert] Introduce tests for Reshape, Squeeze and ExpandDims
#14228	[onert] Introduce capabilities to find operands which can share memory
#14229	[onert/cpu] [Reshape, ExpandDims] Avoid copying memory if possible
#14230	[onert] Propagate shared memory operand indexes to cpu backend

The rest of changes are deeply dependent so I'll push it later.

This was referenced Oct 16, 2024

[onert] Introduce tests for Reshape, Squeeze and ExpandDims #14227

Open

[onert] Introduce capabilities to find operands which can share memory #14228

Open

[onert/cpu] [Reshape, ExpandDims] Avoid copying memory if possible #14229

Open

[onert] Propagate shared memory operand indexes to cpu backend #14230

Open

Contributor Author

mbencer commented Oct 17, 2024

Some time results (for 50 repeats) from my dev machine. Note: do NOT treat it as an official results:

From current branch

mobilenet v2:

MODEL_LOAD   takes 4.141 ms
PREPARE      takes 11.126 ms
EXECUTE      takes 6.264 ms
- MEAN     :  6.264 ms
- MAX      :  7.880 ms
- MIN      :  6.060 ms
- GEOMEAN  :  6.260 ms

mnist

MODEL_LOAD   takes 0.186 ms
PREPARE      takes 1.314 ms
EXECUTE      takes 0.220 ms
- MEAN     :  0.220 ms
- MAX      :  1.688 ms
- MIN      :  0.153 ms
- GEOMEAN  :  0.195 ms

From master

mobilenet v2:

MODEL_LOAD   takes 4.051 ms
PREPARE      takes 11.240 ms
EXECUTE      takes 6.298 ms
- MEAN     :  6.298 ms
- MAX      :  8.260 ms
- MIN      :  6.031 ms
- GEOMEAN  :  6.292 ms

mnist

MODEL_LOAD   takes 0.210 ms
PREPARE      takes 1.353 ms
EXECUTE      takes 0.233 ms
- MEAN     :  0.233 ms
- MAX      :  1.711 ms
- MIN      :  0.151 ms
- GEOMEAN  :  0.204 ms

Conclusion: Preparation time increases about 1% for mobilenet and almost 3% for mnist. Execution time(mean) decreases about 0.54% for mobilenet and 5.6% for mnist.

Contributor Author

mbencer commented Oct 21, 2024

@mbencer Is is possible to split this PR to 3 PRs for each operator (i.e. Reshape, ExpandDims and Squeeze). If possible, could you please kindly create smaller PRs? It is a bit big for me to review as whole.
cc @hseok-oh, @ragmani

I've split part of the implementation into smaller PRs:
PR link description
#14227 [onert] Introduce tests for Reshape, Squeeze and ExpandDims
#14228 [onert] Introduce capabilities to find operands which can share memory
#14229 [onert/cpu] [Reshape, ExpandDims] Avoid copying memory if possible
#14230 [onert] Propagate shared memory operand indexes to cpu backend

The rest of changes are deeply dependent so I'll push it later.

@hseok-oh, @ragmani @zetwhite If you find a moment please take a look for PRs to review ;)

Contributor

zetwhite commented Oct 22, 2024

If you find a moment please take a look for PRs to review ;)

Thanks for the notice. I'll take a look :)

Contributor

zetwhite commented Oct 24, 2024

If you find a moment please take a look for PRs to review ;)

Thanks for the notice. I'll take a look :)

I read the draft and understood the overall direction. I could review your PR.
But I'm afraid that some runtime members (@Samsung/one_onert ) are out of the office until the middle of November, so it might be hard to get others' reviews.

zetwhite reviewed

View reviewed changes

runtime/onert/core/include/backend/basic/BackendContextHelpers.h Outdated

Comment on lines 186 to 202

+                std::vector<ir::OperandIndex> registered_source_ind;
+                for (const auto &[_, source_ind] : tensor_builder->getSharedMemoryOperandIndexes())
+                {
+                  if (ctx.external_operands().contains(source_ind))
+                    continue;
+                  if (tensor_builder->isRegistered(source_ind)) // some tensors can have the same source
+                    continue;
+                  tensor_builder->registerTensorInfo(source_ind, graph.operands().at(source_ind).info());
+                  registered_source_ind.emplace_back(source_ind);
+                }
                 graph.operands().iterate([&](const ir::OperandIndex &ind, const ir::Operand &obj) {
                   if (ctx.external_operands().contains(ind))
                     return;
+                  if (std::find(std::begin(registered_source_ind), std::end(registered_source_ind), ind) !=
+                      std::end(registered_source_ind)) // skip tensors already registered
+                    return;

Contributor

zetwhite Oct 30, 2024 •

edited

Loading

While reviewing #14228, I re-read this PR.

I'm a bit confused about this part.
In genTensors(), Is it sufficient just to register the source_ind first?

I thought this draft tried to allocate only source_operand and avoid allocating shared_operand.
_{(source_operand - operand matched with source_ind, shared_operand - operand matched with share_ind)}
But I failed to understand how this code removes the allocation of shared operand.

@mbencer I guess there is sth I missed. Could you help me to understand?

Contributor Author

mbencer Oct 30, 2024

Note that registerTensorInfo is responsible just for buildTensor (both non-const and const).

The second step is calling allocateNonconsts where we are calling tensor->setBuffer. Just at this point we are passing the same buffer for source tensor and shared tensor. It requires of course special handling of such memory lifetime. The lifetime is controlled by StaticTensorManager::claimPlan called during first use of memory buffer and StaticTensorManager::releasePlan called during the last use of memory buffer (the graph has to be topologically sorted).

Conclusion - we are creating tensors both for source_operand and shared_operand - they are just share memory buffer passed by setBuffer method.

Registering source tensors at the beginning here is needed to proper handling cases where a source tensor is constant - in such a case the shared tensor tensors has to be also a constant (has ExternalTensor type).
Without this additional code here we have no guarantee that source operands will be processed at first.

Contributor

zetwhite Oct 31, 2024 •

edited

Loading

Aha, Thank you a lot for your kind explanation 👍
I missed the changes in StaticTensorManager.cc. Now i clearly understood it :)

mbencer requested a review from zetwhite

October 30, 2024 12:28


          BackendContext refactor

732e33b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet