Loading time is really slow with large thread count once again #293

jerinphilip · 2021-12-30T11:41:41Z

This is identified to be a bergamot translator issue in XapaJIaMnu/translateLocally#76.

A nice solution may involve shared model memory across worker threads (avoiding intgemm/shortlist preprocessing placing stuff in graph [needs verification]). This memory will be owned by TranslationModel. Everything transient will remain in the workspace and workspace attached to the worker. This issue is closely related to #257.

A temporary workaround provided by @jelmervdl is:

diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp
index 9d2eb0cdb73526584d53e5cc2e32facfffc9650e..753b500fea4629fde1452b67f76d5862185a1df8 100644
--- a/src/translator/translation_model.cpp
+++ b/src/translator/translation_model.cpp
@@ -45,8 +45,15 @@ TranslationModel::TranslationModel(const Config &options, MemoryBundle &&memory
     }
   }
 
+  std::vector<std::future<void>> loadCalls;
+  loadCalls.resize(replicas);
+
   for (size_t idx = 0; idx < replicas; idx++) {
-    loadBackend(idx);
+    loadCalls[idx] = std::async(&TranslationModel::loadBackend, this, idx);
+  }
+
+  for (auto &&loadCall : loadCalls) {
+    loadCall.wait();
   }
 }

Unsure about putting std::thread or std::async within TranslationModel, the threading and delegations should ideally be within Service. As part of resolving, we should ideally check-in something on var through BRT which checks model-loading speeds remain unaffected hereafter.

The text was updated successfully, but these errors were encountered:

XapaJIaMnu · 2021-12-30T12:00:06Z

Marian is not designed to work with shared memories across multiple graphs. mmap was used as a hack to achieve that in master, but it's not compatible with models that are being prepared on demand (which we have to do if we want efficient multiplication on a wide range of hardware.) Making graphs work with the same shared (constant) memory is doable, but hacky with the current implementation.

jelmervdl · 2022-01-03T10:20:44Z

I hadn't thought about the pivoting use case. That would not work well with the std::thread() approach. std::async should still be safe, but I agree that it isn't ideal as you give away control (and what about freeing the TranslationModel while the async scheduled lambda hasn't been invoked yet?)

I'd have a small preference for not having translate() also do the initialisation, so speed is predictable (and we can measure it "accurately" in translateLocally) and also to help users have faster initial translation by beginning to load the model as soon as they pick the language pair (and do the loading while they're switching focus to the next task of pasting/typing their input text). But I don't see an easy solution.

For wasm it might be easier to have loading happen somewhere in translate() automatically. It could do all the setup in Javascript relatively cheap, and only once it is asked to do any actual work the more expensive processes of loading the model and translating takes place.

One thing I was thinking of was just passing in empty translation requests, which would trigger that initial load. It would also solve the issue of having to retain the model instance until it is loaded because the request would own a shared_ptr to it. But the batcher doesn't have a way to force me to send a request to each worker specifically. So that's a no-go.

I think the best way to go right now is do initialisation (loadBackend()) on demand in the worker. It saves the most resources (what if you set up 8 workers but only end up using one because of low demand? Why waste time loading the other 7?) and doesn't involve changing the external API. Possible future improvements might then be a method for AsyncService to optionally force-load the models (to do loading while the user is doing other things), and move the timing information (wps count) into bergamot-translator so it can measure speed without the loading call.

jerinphilip · 2022-01-03T12:38:14Z

Pivoting should not conflict with any edits here is my understanding.

If I understand correctly, you're concerned about the first translation for the user. I think going from lazy to eager is achievable externally by an empty translation and discard, but laziness has to be implemented internally. So I like the empty-translation approach followed by making things lazy, deferring the expensive op until it is absolutely required.

be a method for AsyncService to optionally force-load the models (to do loading while the user is doing other things)

This sounds like a way to go. A factory for TranslationModel as a member function in Service would ensure stuff is loaded (it can have a bool eager = true|false. I think AsyncService already has createCompatibleModel now. I think we can use createCompatibleModel to achieve what is intended.

move the timing information (wps count) into the bergamot-translator so it can measure speed without the loading call.

I remember visiting a few translateLocally threads where this was part of the discussion and understand there's value in this - especially to developers and enthusiasts. However, my thinking is the normal person will not bother wps or how things are happening - all they'd prefer to have is the translation in front of them. Let's open an issue for this and take this forward.

jelmervdl · 2022-01-03T13:42:28Z

If I understand correctly, you're concerned about the first translation for the user. I think going from lazy to eager is achievable externally by an empty translation and discard, but laziness has to be implemented internally. So I like the empty-translation approach followed by making things lazy, deferring the expensive op until it is absolutely required.

yep that was what I was thinking! Only I’m a bit sceptical that just an empty translation is sufficient as I didn’t see an easy way to guarantee all workers receiving a batch. The hatchet doesn’t seem to be aware of which worker is requesting a batch, so no easy way to get an empty batch to each of them.

But just having lazy loading would already be a huge improvement on its own. We can worry about making it eager (but parallel!) via some tricks afterwards

XapaJIaMnu · 2022-01-03T14:06:19Z

The main problem as it stands currently is that if you try to load translateLocally (especially on Windows) with a gaming machine with 24 cores you would have a very annoying and inexplicable 6 second lag until you can do the first translation.

I don't quite see the merit of lazy initialisation, I think most of the time the user would want to translate more than n sentences where n is the number of threads locally available and that initialisation in parallel of n workers would take as much time as initialising one worker (discarding factors such as turbo single thread boost)

jelmervdl · 2022-01-03T14:57:52Z

don't quite see the merit of lazy initialisation

It's a bit of a misnomer in this case maybe. The problem right now is that all workers are initialised sequentially on the main thread. Delaying that initialisation till its necessary is just an easy way to move the initialisation into the worker threads. The worker threads can then all individually do their own initialisation, so the initialisation can be done in parallel.

jelmervdl · 2022-01-03T15:05:33Z

One thing I'm worried about is that when you have 24 threads, and 24 workers, but a very small workload (i.e. you just started typing in translateLocally and you have translate-as-you-type turned on) will those sentences end up in that mostly-idle worker that has the model already loaded, or will it end up at one of the other 23 workers that then first have to do the (lazy) loading of the model, and only then provide a translation? Context: translateLocally currently still makes sure there is only one translation request at a time. But such a request could of course contain multiple batches.

kpu · 2022-01-03T15:24:47Z

We should put energy into solving the underlying problem by loading the model once and sharing the memory across threads, rather than kludges on top.

…

On January 3, 2022 2:06:30 PM UTC, Nikolay Bogoychev ***@***.***> wrote: The main problem as it stands currently is that if you try to load translateLocally (especially on Windows) with a gaming machine with 24 cores you would have a very annoying and inexplicable 6 second lag until you can do the first translation. I don't quite see the merit of lazy initialisation, I think most of the time the user would want to translate more than n sentences where n is the number of threads locally available and that initialisation in parallel of n workers would take as much time as initialising one worker (discarding factors such as turbo single thread boost) -- Reply to this email directly or view it on GitHub: #293 (comment) You are receiving this because you are subscribed to this thread. Message ID: ***@***.***>

XapaJIaMnu · 2022-01-04T19:25:04Z

Until this is done though, can we have a quick fix?

jelmervdl · 2022-01-05T09:31:02Z

From what I've gathered so far:

You need a graph and a scorer per device
Creating those scorers and calling graph->forward() is the slow bit
marian does it using a thread pool
Loading the model's data which works all the way through to binary::loadItems where each item either stores some model data on its own or using an external pointer (used for mmap) but those contents are always copied (to tensors).

So solving the loading speed regression would be loading in the worker threads as there is no way to avoid not initialising N graphs and N scorers. Either through lazy loading or having a bit more complex threadpool (there's one in marian's 3rd_party/threadpool.h) that can handle arbitrary tasks. Marian just opens a new threadpool for loading and translating which guarantees that loading has finished before any translating tasks are started. Its pretty similar to my std::async quick hack.

Sidetracked: I noticed that we now run a graph with a certain device id always in the same thread, but I can't find any code in tensor/cpu/backend.h that specifically calls for any affinity with that device id, it's just used seed the random generator.

There might be opportunities to reduce memory consumption a bit by hooking the loading from memory bundles into the mmap loading path. Maybe that's already happening, not sure yet. Edit: when loading from a bundle, it's already using the same createScorers(Ptr<Options> options, const std::vector<const void*>& ptrs) where the mmap path would have ended up at.

Slightly related to this, there is the reduce memory consumption by sharing a workspace work that might speed up loading a second model because we can re-use the workspace.

jelmervdl · 2022-01-05T11:15:44Z

Quick test in translateLocally:

Benchmark 1: ./translateLocally-8cf63d4 -platform offscreen -m en-de-base -i ../../crime-and-punishment-1.txt -o /dev/null
  Time (mean ± σ):      7.012 s ±  0.092 s    [User: 12.772 s, System: 1.979 s]
  Range (min … max):    6.879 s …  7.147 s    10 runs

Benchmark 2: ./translateLocally-bundle-load -platform offscreen -m en-de-base -i ../../crime-and-punishment-1.txt -o /dev/null
  Time (mean ± σ):      7.731 s ±  0.049 s    [User: 14.143 s, System: 1.236 s]
  Range (min … max):    7.665 s …  7.831 s    10 runs

Benchmark 3: ./translateLocally-parallel-init -platform offscreen -m en-de-base -i ../../crime-and-punishment-1.txt -o /dev/null
  Time (mean ± σ):      3.440 s ±  0.029 s    [User: 14.313 s, System: 4.972 s]
  Range (min … max):    3.406 s …  3.485 s    10 runs

Benchmark 4: ./translateLocally-parallel-bundle-load -platform offscreen -m en-de-base -i ../../crime-and-punishment-1.txt -o /dev/null
  Time (mean ± σ):      3.095 s ±  0.015 s    [User: 15.131 s, System: 2.908 s]
  Range (min … max):    3.073 s …  3.122 s    10 runs

Looks like parallel loading benefits from using the bundle already.

bundle-load (patch for translateLocally):

diff --git a/src/MarianInterface.cpp b/src/MarianInterface.cpp
index 378b08402941b2a45f8907dc4109d0fb39a8759f..0b9d2af438c3bff09286788cb2e1ac048e79fb61 100644
--- a/src/MarianInterface.cpp
+++ b/src/MarianInterface.cpp
@@ -112,7 +113,10 @@ MarianInterface::MarianInterface(QObject *parent)
                     // service is done with it, which it is since all translation
                     // requests are effectively blocking in this thread.
                     auto modelConfig = makeOptions(modelChange->config_file, modelChange->settings);
-                    model = std::make_shared<marian::bergamot::TranslationModel>(modelConfig, marian::bergamot::MemoryBundle{}, modelChange->settings.cpu_threads);
+
+                    auto bundle = marian::bergamot::getMemoryBundleFromConfig(modelConfig);
+
+                    model = std::make_shared<marian::bergamot::TranslationModel>(modelConfig, std::move(bundle), modelChange->settings.cpu_threads);
                 } else if (input) {
                     if (model) {
                         std::future<int> wordCount = std::async(countWords, *input); // @TODO we're doing an "unnecessary" string copy here (necessary because we std::move input into service->translate)

parallel-init (patch for bergamot-translator):

diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h
index 6d2169494ee1ec1d66c0275b56a17a6d4e1e16ff..f9400751bed3c316db1f06ec5b0c6229ea2bec61 100644
--- a/src/translator/translation_model.h
+++ b/src/translator/translation_model.h
@@ -107,6 +107,7 @@ class TranslationModel {
 
     Graph graph;
     ScorerEnsemble scorerEnsemble;
+    bool initialised{false};
   };
 
   // ShortlistGenerator is purely const, we don't need one per thread.
diff --git a/src/translator/translation_model.cpp b/src/translator/translation_model.cpp
index 9d2eb0cdb73526584d53e5cc2e32facfffc9650e..df2d413f966fd2000a42580058438457570d460a 100644
--- a/src/translator/translation_model.cpp
+++ b/src/translator/translation_model.cpp
@@ -44,21 +44,17 @@ TranslationModel::TranslationModel(const Config &options, MemoryBundle &&memory
                                                                 srcIdx, trgIdx, shared_vcb);
     }
   }
-
-  for (size_t idx = 0; idx < replicas; idx++) {
-    loadBackend(idx);
-  }
 }
 
 void TranslationModel::loadBackend(size_t idx) {
   auto &graph = backend_[idx].graph;
   auto &scorerEnsemble = backend_[idx].scorerEnsemble;
 @@ -172,6 +168,12 @@ Ptr<marian::data::CorpusBatch> TranslationModel::convertToMarianBatch(Batch &bat
 
 void TranslationModel::translateBatch(size_t deviceId, Batch &batch) {
   auto &backend = backend_[deviceId];
+
+  if (!backend.initialised) {
+    loadBackend(deviceId);
+    backend.initialised = true;
+  }
+
   BeamSearch search(options_, backend.scorerEnsemble, vocabs_.target());
   Histories histories = search.search(backend.graph, convertToMarianBatch(batch));
   batch.completeBatch(histories);

XapaJIaMnu · 2022-01-05T18:05:41Z

The bergamot-translator should do the bundle loading inside its constructor. Once it reads the config file, it has access to the memory information and can just load it before initialising workers.. There is no reason why the bundle should happen outside bergamot-translator.

XapaJIaMnu · 2022-01-14T00:33:03Z

Could we please have a simple fix until the proper one, I would like to push a build that doesn't take 8 seconds to start on my pc...

jelmervdl · 2022-01-14T01:12:52Z

I've opened #303 which is basically the diff above. I've got no better short term solution (nor long term, really).

The bergamot-translator should do the bundle loading inside its constructor. Once it reads the config file, it has access to the memory information and can just load it before initialising workers.. There is no reason why the bundle should happen outside bergamot-translator.

Yes please, that would be helpful. I didn't add it to the pull request yet because I couldn't see an easy way of checking whether the MemoryBundle argument was default-constructed one or not. Ideally I could just overload the TranslationModel constructor for the didnt-pass-in-a-memory-bundle case, but the bundle is not the last argument. Thread count is. Which you do want to pass in the TranslateLocally use case.

jerinphilip · 2022-01-14T01:30:52Z

diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h
index 3e79fdb..1861bc9 100644
--- a/src/translator/translation_model.h
+++ b/src/translator/translation_model.h
@@ -58,6 +58,10 @@ class TranslationModel {
   /// ShortlistGenerator, Vocabs and SentenceSplitter.
   TranslationModel(const Config& options, MemoryBundle&& memory = MemoryBundle{}, size_t replicas = 1);
 
+  // Path for translateLocally
+  TranslationModel(const Config& options, size_t replicas = 1)
+      : TranslationModel(options, getMemoryBundleFromConfig(options), replicas) {}
+
   /// Make a Request to be translated by this TranslationModel instance.
   /// @param [in] requestId: Unique identifier associated with this request, available from Service.
   /// @param [in] source: Source text to be translated. Ownership is accepted and eventually returned to the client in

Something like this?

jelmervdl · 2022-01-14T01:34:15Z

Exactly like that.

Does that compile? If you only pass in options, how does the compiler know which of the two constructors to use?

(Don't know why I hadn't thought of that I somehow had in my head that all constructors had to have the same arguments, just more or fewer optional ones?)

jerinphilip · 2022-01-14T01:51:43Z

diff --git a/src/translator/translation_model.h b/src/translator/translation_model.h
index 3e79fdb..3c8de81 100644
--- a/src/translator/translation_model.h
+++ b/src/translator/translation_model.h
@@ -6,6 +6,7 @@
 
 #include "batch.h"
 #include "batching_pool.h"
+#include "byte_array_util.h"
 #include "cache.h"
 #include "common/utils.h"
 #include "data/shortlist.h"
@@ -56,7 +57,11 @@ class TranslationModel {
   /// @param [in] options: Marian options object.
   /// @param [in] memory: MemoryBundle object holding memory buffers containing parameters to build MarianBackend,
   /// ShortlistGenerator, Vocabs and SentenceSplitter.
-  TranslationModel(const Config& options, MemoryBundle&& memory = MemoryBundle{}, size_t replicas = 1);
+  TranslationModel(const Config& options, MemoryBundle&& memory, size_t replicas = 1);
+
+  // Path for translateLocally
+  TranslationModel(const Config& options, size_t replicas = 1)
+      : TranslationModel(options, getMemoryBundleFromConfig(options), replicas) {}
 
   /// Make a Request to be translated by this TranslationModel instance.
   /// @param [in] requestId: Unique identifier associated with this request, available from Service.

The above compiles, still unverified as a fix though.

jelmervdl added a commit that referenced this issue Jan 14, 2022

Proposed quick fix for #293 parallel model loading

a6ee7f1

jelmervdl mentioned this issue Jan 14, 2022

Defer model loading to parallel worker thread #303

Merged

jerinphilip mentioned this issue Jan 17, 2022

Provide path for loading MemoryBundle from Options object #304

Closed

jerinphilip closed this as completed Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loading time is really slow with large thread count once again #293

Loading time is really slow with large thread count once again #293

jerinphilip commented Dec 30, 2021 •

edited

Loading

XapaJIaMnu commented Dec 30, 2021

jelmervdl commented Jan 3, 2022

jerinphilip commented Jan 3, 2022 •

edited

Loading

jelmervdl commented Jan 3, 2022 •

edited

Loading

XapaJIaMnu commented Jan 3, 2022

jelmervdl commented Jan 3, 2022

jelmervdl commented Jan 3, 2022

kpu commented Jan 3, 2022 via email

XapaJIaMnu commented Jan 4, 2022

jelmervdl commented Jan 5, 2022 •

edited

Loading

jelmervdl commented Jan 5, 2022 •

edited

Loading

XapaJIaMnu commented Jan 5, 2022

XapaJIaMnu commented Jan 14, 2022

jelmervdl commented Jan 14, 2022

jerinphilip commented Jan 14, 2022

jelmervdl commented Jan 14, 2022

jerinphilip commented Jan 14, 2022

Loading time is really slow with large thread count once again #293

Loading time is really slow with large thread count once again #293

Comments

jerinphilip commented Dec 30, 2021 • edited Loading

XapaJIaMnu commented Dec 30, 2021

jelmervdl commented Jan 3, 2022

jerinphilip commented Jan 3, 2022 • edited Loading

jelmervdl commented Jan 3, 2022 • edited Loading

XapaJIaMnu commented Jan 3, 2022

jelmervdl commented Jan 3, 2022

jelmervdl commented Jan 3, 2022

kpu commented Jan 3, 2022 via email

XapaJIaMnu commented Jan 4, 2022

jelmervdl commented Jan 5, 2022 • edited Loading

jelmervdl commented Jan 5, 2022 • edited Loading

XapaJIaMnu commented Jan 5, 2022

XapaJIaMnu commented Jan 14, 2022

jelmervdl commented Jan 14, 2022

jerinphilip commented Jan 14, 2022

jelmervdl commented Jan 14, 2022

jerinphilip commented Jan 14, 2022

jerinphilip commented Dec 30, 2021 •

edited

Loading

jerinphilip commented Jan 3, 2022 •

edited

Loading

jelmervdl commented Jan 3, 2022 •

edited

Loading

jelmervdl commented Jan 5, 2022 •

edited

Loading

jelmervdl commented Jan 5, 2022 •

edited

Loading